Commit 9fa09e8d authored by Bharath Ramsundar's avatar Bharath Ramsundar
Browse files

Tutorial updates

parent 21d20383
Loading
Loading
Loading
Loading
+33 −19
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# How to train your DragoNN tutorial

**Tutorial length**: 25-30 minutes with a CPU.

## Outline
    * How to use this tutorial
    * Review of patterns in transcription factor binding sites
    * Learning to localize homotypic motif density
    * Sequence model definition
    * Training and interpretation of
        - single layer, single filter DragoNN
        - single layer, multiple filters DragoNN
        - Multi-layer DragoNN
        - Regularized multi-layer DragoNN
    * Critical questions in this tutorial:
        - What is the "right" way to get insight from a DragoNN model?
        - What are the limitations of different interpretation methods?
        - Do those limitations depend on the model and the target pattern?
    * Suggestions for further exploration

Github issues on the dragonn repository with feedback, questions, and discussion are always welcome.


## How to use this tutorial

This tutorial utilizes a Jupyter/IPython Notebook - an interactive computational enviroment that combines live code, visualizations, and explanatory text. The notebook is organized into a series of cells. You can run the next cell by cliking the play button:
![play button](./tutorial_images/play_button.png)
You can also run all cells in a series by clicking "run all" in the Cell drop-down menu:
![play all button](./tutorial_images/play_all_button.png)
Half of the cells in this tutorial contain code, the other half contain visualizations and explanatory text. Code, visualizations, and text in cells can be modified - you are encouraged to modify the code as you advance through the tutorial. You can inspect the implementation of a function used in a cell by following these steps:
![inspecting code](./tutorial_images/inspecting_code.png)

We start by loading dragonn's tutorial utilities and reviewing properties of regulatory sequence that transcription factors bind.

%% Cell type:code id: tags:

``` python
%reload_ext autoreload
%autoreload 2
#from tutorial_utils import *
%matplotlib inline
```

%% Cell type:markdown id: tags:

![sequence properties 1](./tutorial_images/sequence_properties_1.jpg)
![sequence properties 2](./tutorial_images/sequence_properties_2.jpg)

# Learning to localize a homotypic motif density
In this tutorial we will learn how to localize a homotypic motif cluster. We will simulate a positive set of sequences with multiple instances of a motif in the center and a negative set of sequences with multiple motif instances positioned anywhere in the sequence:
![honotypic motif density localization](./tutorial_images/homotypic_motif_density_localization.jpg)
We will then train a binary classification model to classify the simulated sequences. To solve this task, the model will need to learn the motif pattern and whether instances of that pattern are present in the central part of the sequence.

We start by getting the simulation data.

%% Cell type:markdown id: tags:

## Getting simulation data

DragoNN provides a set of simulation functions. We will use the simulate_motif_density_localization function to simulate homotypic motif density localization. First, we obtain documentation for the simulation parameters.

%% Cell type:code id: tags:

``` python
#print_simulation_info("simulate_motif_density_localization")
from simulations import simulate_motif_density_localization
print(simulate_motif_density_localization.__doc__)
```

%% Output

    
        Simulates two classes of seqeuences:
            - Positive class sequences with multiple motif instances
              in center of the sequence.
            - Negative class sequences with multiple motif instances
              anywhere in the sequence.
        The number of motif instances is uniformly sampled
        between minimum and maximum motif counts.
    
        Parameters
        ----------
        motif_name : str
            encode motif name
        seq_length : int
            length of sequence
        center_size : int
            length of central part of the sequence where motifs can be positioned
        min_motif_counts : int
            minimum number of motif instances
        max_motif_counts : int
            maximum number of motif instances
        num_pos : int
            number of positive class sequences
        num_neg : int
            number of negative class sequences
        GC_fraction : float
            GC fraction in background sequence
    
        Returns
        -------
        sequence_arr : 1darray
            Contains sequence strings.
        y : 1darray
            Contains labels.
        embedding_arr: 1darray
            Array of embedding objects.
    

%% Cell type:markdown id: tags:

Next, we define parameters for a TAL1 motif density localization in 1500bp long sequence, with 0.4 GC fraction, and 2-4 instances of the motif in the central 150bp for the positive sequences. We simulate a total of 3000 positive and 3000 negative sequences.

%% Cell type:code id: tags:

``` python
motif_density_localization_simulation_parameters = {
    "motif_name": "TAL1_known4",
    "seq_length": 1000,
    "center_size": 150,
    "min_motif_counts": 2,
    "max_motif_counts": 4,
    "num_pos": 3000,
    "num_neg": 3000,
    "GC_fraction": 0.4}
```

%% Cell type:markdown id: tags:

We get the simulation data by calling the get_simulation_data function with the simulation name and the simulation parameters as inputs. 1000 sequences are held out for a test set, 1000 sequences for a validation set, and the remaining 4000 sequences are in the training set.

%% Cell type:code id: tags:

``` python
#simulation_data = get_simulation_data("simulate_motif_density_localization",
#                                      motif_density_localization_simulation_parameters,
#                                      validation_set_size=1000, test_set_size=1000)
sequences, y, embed = simulate_motif_density_localization(**motif_density_localization_simulation_parameters)
```

%% Cell type:code id: tags:

``` python
import deepchem as dc
from utils import one_hot_encode
from sklearn.model_selection import train_test_split

splitter = dc.splits.RandomSplitter()
X = one_hot_encode(sequences)

X_train, X_rem, y_train, y_rem, embed_train, embed_rem = train_test_split(X, y, embed, test_size=.3)
X_valid, X_test, y_valid, y_test, embed_valid, embed_test = train_test_split(X_rem, y_rem, embed_rem, test_size=.5)

train = dc.data.NumpyDataset(X_train, y_train)
valid = dc.data.NumpyDataset(X_valid, y_valid)
test = dc.data.NumpyDataset(X_test, y_test)
print(type(X), type(y), type(embed))
print(X.shape, y.shape, len(embed))
print(sequences[:1])
print(y[:1])
print([print(emb) for emb in embed[0]])

print("X_train.shape")
print(X_train.shape)
print("X_valid.shape")
print(X_valid.shape)
print("X_test.shape")
print(X_test.shape)
print("y_train.shape")
print(y_train.shape)
print("y_valid.shape")
print(y_valid.shape)
print("y_test.shape")
print(y_test.shape)

print(X.shape)
print(X[0, :, :, :10])
```

%% Output

    <class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'list'>
    (6000, 1, 4, 1000) (6000, 1) 6000
    [ 'CAATCATTATCTTGCCATGTTGAAAGGGATAACATTTAGCATGATGACAATTTGGTCATAATAATGAGCTATTTTGTGGAGTGGGTGGAAATTTAAGTTGTACCTGGGGTTAATCAAATGACATATCTTGCATTATGAGAGTCGATCGGTATAAGGACCCCGTCCGGCATACGTTAGTACAATCATTTGATTTATACAGTTAGCGTAGCAAAGCGTATTCCTTGTATCTGATATTTTAACTGGCAATGATAATACCGGATTCGGTTAGGAACGGGTTCGATGCGTTCCGCCATAGTCTATTTAATCATATTTTACGTTATTGTCCACGTACTCGCTATTAGGTTCCTGTGTAAGTACCATTGGGTTTGCTGAATTTGCCGTGGAAGGAGGAAGCCATCTATCAATAAAAGAATAGCGCCTACTCCCGCTCATAGAGCCTATAGAAAGATCAGCTGGACTCCGGCATATCAGCTGGAGATCAGATGGCGCTCACCAGCAATGTAAATTTTAAGAGAGTGCTTGTTAACTTATGGAGATAATTTTTAAGCTCAGCTGAAGAAGCAGCTGGAAACTTAGTGGCCAATGCGAGATCTTGTGCGTTATGTGTGATCTAGGCGCTTTCGAATCGGTTATGGGGTCAATCTATAACTAATACATATTTATACATGTGACTAAGTGAATAAATCTGTTTATGACAGTGGTCCACCCCGTAATGTGAATCTGTTTAAGGTATAAGGGCTTTTAAAAATATACTATACATTCTAATCTGCTTCACCAAGTCGAAACCTCACTCGAATGTAACCGGAGAGTAAAAGATTTCTCTGTTTTGCAGAAACTGATCGTATTCCTGACAGAAGGATATTTAATTATTGGCTCTTTTCAGTTACTCAACGCCAAACTAATTTAGGCTCATTTCATAAGATCTAAATGGCCTAACGTCATCTTAAGCAGACAAATGATGTTATTTGGTAATAGATGCGGTTCAAATGAATGGAACTTA']
    [ 'ATTACCGTAATCTACTATTAAGTCACAACCAAACAATGGATTACTTTCTGCGTTGGAATCAGTGCCGTGCCAATTGCAGTTGTAGTGCAGTATTTTTGGCATGAGCCCGGGCAAAGTTTTATGAAATAAGCAAGAATCCCACCAATGAGTAAATATGGATTGAGCGCGAATTCTCTTCAATATTGATTTGCCAGCAAGACCTTAACTTCAGTTCTGCTATAATATGTCCATGTTAGAAATTTCGTCGAAATGTCATTAGAATAATCAAATATCTTAACAGAATAGCCATTTAAGTGGGAGCAAATCGGTCCCCTTCCGGGAAAAGAAGATCTAGTATATTTCAGTATATACTTTTTGACATGAATTGGTCCCAAGCGACAAATCCGAAGGGCGTACCACGTTAGAAGTATTCGTCTTGTTTGAAAGTAGTCGAACCAGCTGATTGTCAGCTGGTATTAGCAAATAACAATCAGGTGGATTTGCTTGCAATATGTTTAGGCGCGACCCCTGCGCAAAGTTGTCTCTTCAGGCACCCTGTTTCTGGGCGGTTTTTGCATTGATCCAGTCCGAGTGCGGAATAATGCGTAGTTAGAGATTCATTTATGCCTTAGATCTTGAGAATCTATTCACCAAAGTAATTTGCTAGACCCAATCGATCGTTATTCCGTCTTATAAGACACTTAATAAAAAGTGTGCGTGAATGCGGGACCATCGTTTTATATTTAATCCTAGCACTTAGATAACTCATTTCCAGCCTTGAGCTTTTGTTAACGACGATGCACCTGGATCCATCTTTTTTTAGGTTTTCTGCCTGAAGGTACTAAGAAAGGAATGATACTTATGTTCGGTTTTTAATTGCCTTGATCAAGAACTTTGGGGGTTCGCCGGGTAGTTGCCTTGTTTCCTCTCCCACCAAATGATAGTGCTCTCTTAATAAACTCTTAGTGGCTTAGCGAAACTTTGACGAGTATACTAGCTTACTGATCCATGTTAGCTTAAT']
    [[ True]]
    pos-559_revComp-TAL1_known4-AGCAGCTGGA
    pos-477_revComp-TAL1_known4-ATCAGATGGC
    pos-466_revComp-TAL1_known4-ATCAGCTGGA
    pos-447_revComp-TAL1_known4-ATCAGCTGGA
    [None, None, None, None]
    pos-433_TAL1_known4-ACCAGCTGAT
    pos-468_revComp-TAL1_known4-ATCAGGTGGA
    pos-444_revComp-TAL1_known4-GTCAGCTGGT
    [None, None, None]
    X_train.shape
    (4200, 1, 4, 1000)
    X_valid.shape
    (900, 1, 4, 1000)
    X_test.shape
    (900, 1, 4, 1000)
    y_train.shape
    (4200, 1)
    y_valid.shape
    (900, 1)
    y_test.shape
    (900, 1)
    (6000, 1, 4, 1000)
    [[[0 1 1 0 0 1 0 0 1 0]
      [1 0 0 0 1 0 0 0 0 0]
      [0 0 0 0 0 0 0 0 0 0]
      [0 0 0 1 0 0 1 1 0 1]]]
    [[[1 0 0 1 0 0 0 0 1 1]
      [0 0 0 0 1 1 0 0 0 0]
      [0 0 0 0 0 0 1 0 0 0]
      [0 1 1 0 0 0 0 1 0 0]]]

%% Cell type:markdown id: tags:

simulation_data provides training, validation, and test sets of input sequences X and sequence labels y. The inputs X are matrices with a one-hot-encoding of the sequences:
![one hot encoding](./tutorial_images/one_hot_encoding.png)
Here are the first 10bp of a sequence in our training data:

%% Cell type:code id: tags:

``` python
#simulation_data.X_train[0, :, :, :10]
X_train[0, :, :, :10]
```

%% Output

    array([[[0, 1, 1, 0, 0, 0, 0, 0, 0, 0],
            [0, 0, 0, 0, 0, 1, 1, 0, 1, 0],
            [0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
            [1, 0, 0, 0, 1, 0, 0, 1, 0, 1]]], dtype=int32)
    array([[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
            [0, 0, 0, 1, 0, 0, 1, 0, 1, 1],
            [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
            [1, 1, 0, 0, 1, 1, 0, 0, 0, 0]]], dtype=int32)

%% Cell type:markdown id: tags:

This matrix represent the 10bp sequence AAATGGGCCG.

# The homotypic motif density localization task
The goal of the model is to take the positive and negative sequences simulated above and classify them:
![classificatioin task](./tutorial_images/homotypic_motif_density_localization_task.jpg)

%% Cell type:markdown id: tags:

# DragoNN Models

A locally connected linear unit in a DragoNN model can represent a PSSM (part a). A sequence PSSM score is obtained by multiplying the PSSM across the sequence, thersholding the PSSM scores, and taking the max (part b). A PSSM score can also be computed by a DragoNN model with tiled locally connected linear units, amounting to a convolutional layer with a single convolutional filter representing the PSSM, followed by ReLU thersholding and maxpooling (part c).
![dragonn vs pssm](./tutorial_images/dragonn_and_pssm.jpg)
By utilizing multiple convolutional layers with multiple convolutional filters, DragoNN models can represent a wide range of sequence features in a compositional fashion:
![dragonn model figure](./tutorial_images/dragonn_model_figure.jpg)

%% Cell type:markdown id: tags:

# Getting a DragoNN model

The main DragoNN model class is SequenceDNN, which provides a simple interface to a range of models and methods to train, test, and interpret DragoNNs. SequenceDNN uses [keras](http://keras.io/), a deep learning library for [Theano](https://github.com/Theano/Theano) and [TensorFlow](https://github.com/tensorflow/tensorflow), which are popular software packages for deep learning.

To get a DragoNN model we:

    1) Define the DragoNN architecture parameters
        - obtain description of architecture parameters using the inspect_SequenceDNN() function
    2) Call the get_SequenceDNN function, which takes as input the DragoNN architecture parameters, and outputs a
    randomly initialized DragoNN model.

%% Cell type:markdown id: tags:

To get a description of the architecture parameters we use the inspect_SequenceDNN function, which outputs documentation for the model class including the architecture parameters:

%% Cell type:code id: tags:

``` python
#inspect_SequenceDNN()
from deepchem.models.tensorgraph.models.sequence_dnn import SequenceDNN
print(SequenceDNN.__doc__)
```

%% Output

    
      Sequence DNN models.
    
      Parameters
      ----------
      seq_length : int
          length of input sequence.
      num_tasks : int, optional
          number of tasks. Default: 1.
      num_filters : list[int] | tuple[int]
          number of convolutional filters in each layer. Default: (15,).
      conv_width : list[int] | tuple[int]
          width of each layer's convolutional filters. Default: (15,).
      pool_width : int
          width of max pooling after the last layer. Default: 35.
      L1 : float
          strength of L1 penalty.
      dropout : float
          dropout probability in every convolutional layer. Default: 0.
      verbose: bool
          Verbose print statements activated if true.
    

%% Cell type:markdown id: tags:

"Available methods" display what can be done with a SequenceDNN model. These include common operations such as training and testing the model, and more complex operations such as extracting insight from trained models. We define a simple DragoNN model with one convolutional layer with one convolutional filter, followed by maxpooling of width 35.

%% Cell type:code id: tags:

``` python
one_filter_dragonn_parameters = {
    'seq_length': 1000,
    'num_filters': [1],
    'conv_width': [10],
    'pool_width': 35}
```

%% Cell type:markdown id: tags:

we get a randomly initialized DragoNN model by calling the get_SequenceDNN function with one_filter_dragonn_parameters as the input

%% Cell type:code id: tags:

``` python
one_filter_dragonn = SequenceDNN(**one_filter_dragonn_parameters)
seq_dnn = SequenceDNN(1000, num_filters=1)
```

%% Cell type:markdown id: tags:

## Training a DragoNN model

Next, we train the one_filter_dragonn by calling train_SequenceDNN with one_filter_dragonn and simulation_data as the inputs. In each epoch, the one_filter_dragonn will perform a complete pass over the training data, and update its parameters to minimize the loss, which quantifies the error in the model predictions. After each epoch, the code prints performance metrics for the one_filter_dragonn on the validation data. Training stops once the loss on the validation stops improving for multiple consecutive epochs. The performance metrics include balanced accuracy, area under the receiver-operating curve ([auROC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)), are under the precision-recall curve ([auPRC](https://en.wikipedia.org/wiki/Precision_and_recall)), and recall for multiple false discovery rates  (Recall at [FDR](https://en.wikipedia.org/wiki/False_discovery_rate)).

%% Cell type:code id: tags:

``` python
#train_SequenceDNN(one_filter_dragonn, simulation_data)
#seq_dnn.build()
seq_dnn.fit(train, "binary_crossentropy", nb_epoch=1)
```

%% Output

    Ending global_step 42: Average loss 0
    TIMING: model fitting took 79.998 s

%% Cell type:markdown id: tags:

A single layer, single filter model gets good performance and doesn't overfit much. Let's look at the learning curve to demonstrate this visually:

%% Cell type:code id: tags:

``` python
SequenceDNN_learning_curve(one_filter_dragonn)
```

%% Output

    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-11-7d6a6884d584> in <module>()
    ----> 1 SequenceDNN_learning_curve(one_filter_dragonn)

    NameError: name 'SequenceDNN_learning_curve' is not defined

%% Cell type:markdown id: tags:

# A multi-filter DragoNN model
Next, we modify the model to have 15 convolutional filters instead of just one filter. How does this model compare to the single filter model?

%% Cell type:code id: tags:

``` python
multi_filter_dragonn_parameters = {
    'seq_length': 1000,
    'num_filters': [15], ## notice the change from 1 filter to 15 filters
    'conv_width': [10],
    'pool_width': 35}
multi_filter_dragonn = get_SequenceDNN(multi_filter_dragonn_parameters)
train_SequenceDNN(multi_filter_dragonn, simulation_data)
SequenceDNN_learning_curve(multi_filter_dragonn)
```

%% Cell type:markdown id: tags:

It slightly outperforms the single filter model. Let's check if the learned filters capture the simulated pattern.

%% Cell type:code id: tags:

``` python
interpret_SequenceDNN_filters(multi_filter_dragonn, simulation_data)
```

%% Cell type:markdown id: tags:

Only some of the filters closesly match the simulated pattern. This illustrates that interpreting model parameters directly works partially for multi-filter models. Another way to deduce learned patterns is to examine feature importances for specific examples. Next, we explore methods for feature importance scoring.

%% Cell type:markdown id: tags:

# Interpreting data with a DragoNN model

Using in-silico mutagenesis (ISM) and [DeepLIFT](https://arxiv.org/pdf/1605.01713v2.pdf), we can obtain scores for specific sequence indicating the importance of each position in the sequence. To assess these methods we compare ISM and DeepLIFT scores to motif scores for each simulated motif at each position in the sequence. These motif scores represent the "ground truth" importance of each position because they are based on the motifs used to simulate the data. We plot provide comaprisons for a positive class sequence on the left and a negative class sequence on the right.

%% Cell type:code id: tags:

``` python
interpret_data_with_SequenceDNN(multi_filter_dragonn, simulation_data)
```

%% Cell type:markdown id: tags:

In the positive example (left side), ISM correctly highlights the two motif instances in the central 150bp. DeepLIFT highlights them as well. DeepLIFT also slightly highlights false positive feature on the left side but its score is sufficiently small that we can discriminate between the false positive feature and the true positive features. In the negative example (right side), ISM doesn't highlight anything but DeepLIFT a couple false positive feature almost as much as it highlights true positive features in the positive example.

%% Cell type:markdown id: tags:

# A multi-layer DragoNN model
Next, we train a 3 layer model for this task. Will it outperform the single layer model and to what extent will it overfit?

%% Cell type:code id: tags:

``` python
multi_layer_dragonn_parameters = {
    'seq_length': 1000,
    'num_filters': [15, 15, 15], ## notice the change to multiple filter values, one for each layer
    'conv_width': [10, 10, 10], ## convolutional filter width has been modified to 25 from 45
    'pool_width': 35}

multi_layer_dragonn = get_SequenceDNN(multi_layer_dragonn_parameters)
train_SequenceDNN(multi_layer_dragonn, simulation_data)
SequenceDNN_learning_curve(multi_layer_dragonn)
```

%% Cell type:markdown id: tags:

This model performs about the same as the single layer model but it overfits more. We will try to address that with dropout regularization. But first, what do the first layer filters look like?

%% Cell type:code id: tags:

``` python
interpret_SequenceDNN_filters(multi_layer_dragonn, simulation_data)
```

%% Cell type:markdown id: tags:

The filters now make less sense than in the single layer model case. In multi-layered models, sequence features are learned compositionally across the layers. As a result, sequence filters in the first layer focus more on simple features that can be combined in higher layers to learn motif features more efficiently, and their interpretation becomes less clear based on simple visualizations. Let's see where ISM and DeepLIFT get us with this model.

%% Cell type:code id: tags:

``` python
interpret_data_with_SequenceDNN(multi_layer_dragonn, simulation_data)
```

%% Cell type:markdown id: tags:

As in the single layer model case, ISM correctly highlights the two true positive features in the positive example (left side) and correctly ignores features in the negative example (right side). DeepLIFT still highlight the same false positive feature example in the positive example as before, but we can still separate it from the true positive features. In the negative example, it still highlights some false positive features.

%% Cell type:markdown id: tags:

# A regularized multi-layer DragoNN model
Next, we regularize the 3 layer using 0.2 dropout on every convolutional layer. Will dropout improve validation performance?

%% Cell type:code id: tags:

``` python
regularized_multi_layer_dragonn_parameters = {
    'seq_length': 1000,
    'num_filters': [15, 15, 15],
    'conv_width': [10, 10, 10],
    'pool_width': 35,
    'dropout': 0.2} ## we introduce dropout of 0.2 on every convolutional layer for regularization
regularized_multi_layer_dragonn = get_SequenceDNN(regularized_multi_layer_dragonn_parameters)
train_SequenceDNN(regularized_multi_layer_dragonn, simulation_data)
SequenceDNN_learning_curve(regularized_multi_layer_dragonn)
```

%% Cell type:markdown id: tags:

As expected, dropout decreased the overfitting this model displayed previously and increased validation performance. Let's see the effect on feature discovery.

%% Cell type:code id: tags:

``` python
interpret_data_with_SequenceDNN(regularized_multi_layer_dragonn, simulation_data)
```

%% Cell type:markdown id: tags:

ISM now highlights a false positive feature in the positive example (left side) more than the true positive features. What happened? A sufficiently accurate model should not change its confidence that there are 2 or more features in the central 150 base pairs (bps) due to a single bp change. So it makes sense that in the limit of the "perfect" model ISM will actually lose its power to discover features in this example.

How about DeepLIFT? DeepLIFT correctly highlights the only two positive features in the positive example. So it seems that in the limit of the "perfect" model, DeepLIFT gets closer to the true positive features.

Why did this happen? Why, as we regularize the model and improve the performance, ISM fails to highlight the true positive features? Here is a hint: in the limit of the "perfect" model for this simulation, will a single base pair perturbation to the positive example here change its confidence that it is still a positive example? I encourage you to open github issues on the dragonn repo to discuss these questions.

Below is an overview of patterns and simulations for further exploration.

%% Cell type:markdown id: tags:

# For further exploration
In this tutorial we explored modeling of homotypic motif density. Other properties of regulatory DNA sequence include
![sequence properties 3](./tutorial_images/sequence_properties_3.jpg)
![sequence properties 4](./tutorial_images/sequence_properties_4.jpg)

DragoNN provides simulations that formulate learning these patterns into classification problems:
![sequence](./tutorial_images/sequence_simulations.png)

You can view the available simulation functions by running print_available_simulations:

%% Cell type:code id: tags:

``` python
print_available_simulations()
```
+8 −8
Original line number Diff line number Diff line
@@ -96,16 +96,16 @@ from matplotlib.lines import Line2D
#      print(method_name)


def get_SequenceDNN(SequenceDNN_parameters):
  return SequenceDNN(**SequenceDNN_parameters)
#def get_SequenceDNN(SequenceDNN_parameters):
#  return SequenceDNN(**SequenceDNN_parameters)


def train_SequenceDNN(dnn, simulation_data):
  assert issubclass(type(simulation_data), tuple)
  random.seed(1)
  np.random.seed(1)
  dnn.train(simulation_data.X_train, simulation_data.y_train,
            (simulation_data.X_valid, simulation_data.y_valid))
#def train_SequenceDNN(dnn, simulation_data):
#  assert issubclass(type(simulation_data), tuple)
#  random.seed(1)
#  np.random.seed(1)
#  dnn.train(simulation_data.X_train, simulation_data.y_train,
#            (simulation_data.X_valid, simulation_data.y_valid))


def SequenceDNN_learning_curve(dnn):
+3 −11
Original line number Diff line number Diff line
@@ -14,6 +14,9 @@ class SequenceDNN(Sequential):
  """
  Sequence DNN models.

  # TODO(rbharath): This model only supports one-conv layer. Extend
  # so that conv layers of greater depth can be implemented.

  Parameters
  ----------
  seq_length : int 
@@ -41,8 +44,6 @@ class SequenceDNN(Sequential):
               num_filters=15,
               kernel_size=15,
               pool_width=35,
               GRU_size=35,
               TDD_size=15,
               L1=0,
               dropout=0.0,
               verbose=True,
@@ -50,16 +51,7 @@ class SequenceDNN(Sequential):
    super(SequenceDNN, self).__init__(**kwargs)
    self.num_tasks = num_tasks
    self.verbose = verbose
    assert len(num_filters) == len(conv_width)
    self.add(layers.Conv2D(num_filters, kernel_size=kernel_size))
    self.add(layers.Dropout(dropout))
    self.add(layers.MaxPool2D())
    #if use_RNN:
    #  num_max_pool_outputs = self.model.layers[-1].output_shape[-1]
    #  self.add(Reshape((num_filters[-1], num_max_pool_outputs)))
    #  self.add(Permute((2, 1)))
    #  self.add(GRU(GRU_size, return_sequences=True))
    #  self.add(TimeDistributedDense(TDD_size, activation='relu'))
    self.add(layers.Flatten())
    self.add(layers.Dense(self.num_tasks, activation_fn=tf.nn.relu))
    #self.add(Activation('sigmoid'))
+16 −3
Original line number Diff line number Diff line
@@ -8,7 +8,7 @@ class TestSequenceDNN(unittest.TestCase):
    """Test SequenceDNN can be initialized."""
    model = dc.models.SequenceDNN(10)

  def test_seq_dnn_train(self):
  def test_seq_dnn_singlefilter_train(self):
    """Test SequenceDNN training works."""
    X = np.random.rand(10, 1, 4, 50)
    y = np.random.randint(0, 2, size=(10, 1))
@@ -18,5 +18,18 @@ class TestSequenceDNN(unittest.TestCase):
    #  #    False: num_sequences / num_negatives
    #  #} if not multitask else None,
    dataset = dc.data.NumpyDataset(X, y)
    model = dc.models.SequenceDNN(50)
    model = dc.models.SequenceDNN(50, num_filters=1)
    model.fit(dataset, "binary_crossentropy", nb_epoch=1)
    
  def test_seq_dnn_multifilter_train(self):
    """Test SequenceDNN training works."""
    X = np.random.rand(10, 1, 4, 50)
    y = np.random.randint(0, 2, size=(10, 1))
    #  # TODO(rbharath): Add a test with per-class weighting. 
    #  #class_weight={
    #  #    True: num_sequences / num_positives,
    #  #    False: num_sequences / num_negatives
    #  #} if not multitask else None,
    dataset = dc.data.NumpyDataset(X, y)
    model = dc.models.SequenceDNN(50, num_filters=15)
    model.fit(dataset, "binary_crossentropy", nb_epoch=1)