Tutorial updates (9fa09e8d) · Commits · 钟慕尧 / deepchem

contrib/dragonn/GTC_workshop_tutorial.ipynb

+33 −19

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		# How to train your DragoNN tutorial

		Tutorial length: 25-30 minutes with a CPU.

		## Outline
		* How to use this tutorial
		* Review of patterns in transcription factor binding sites
		* Learning to localize homotypic motif density
		* Sequence model definition
		* Training and interpretation of
		- single layer, single filter DragoNN
		- single layer, multiple filters DragoNN
		- Multi-layer DragoNN
		- Regularized multi-layer DragoNN
		* Critical questions in this tutorial:
		- What is the "right" way to get insight from a DragoNN model?
		- What are the limitations of different interpretation methods?
		- Do those limitations depend on the model and the target pattern?
		* Suggestions for further exploration

		Github issues on the dragonn repository with feedback, questions, and discussion are always welcome.


		## How to use this tutorial

		This tutorial utilizes a Jupyter/IPython Notebook - an interactive computational enviroment that combines live code, visualizations, and explanatory text. The notebook is organized into a series of cells. You can run the next cell by cliking the play button:
		![play button](./tutorial_images/play_button.png)
		You can also run all cells in a series by clicking "run all" in the Cell drop-down menu:
		![play all button](./tutorial_images/play_all_button.png)
		Half of the cells in this tutorial contain code, the other half contain visualizations and explanatory text. Code, visualizations, and text in cells can be modified - you are encouraged to modify the code as you advance through the tutorial. You can inspect the implementation of a function used in a cell by following these steps:
		![inspecting code](./tutorial_images/inspecting_code.png)

		We start by loading dragonn's tutorial utilities and reviewing properties of regulatory sequence that transcription factors bind.

		%% Cell type:code id: tags:

		``` python
		%reload_ext autoreload
		%autoreload 2
		#from tutorial_utils import *
		%matplotlib inline
		```

		%% Cell type:markdown id: tags:

		![sequence properties 1](./tutorial_images/sequence_properties_1.jpg)
		![sequence properties 2](./tutorial_images/sequence_properties_2.jpg)

		# Learning to localize a homotypic motif density
		In this tutorial we will learn how to localize a homotypic motif cluster. We will simulate a positive set of sequences with multiple instances of a motif in the center and a negative set of sequences with multiple motif instances positioned anywhere in the sequence:
		![honotypic motif density localization](./tutorial_images/homotypic_motif_density_localization.jpg)
		We will then train a binary classification model to classify the simulated sequences. To solve this task, the model will need to learn the motif pattern and whether instances of that pattern are present in the central part of the sequence.

		We start by getting the simulation data.

		%% Cell type:markdown id: tags:

		## Getting simulation data

		DragoNN provides a set of simulation functions. We will use the simulate_motif_density_localization function to simulate homotypic motif density localization. First, we obtain documentation for the simulation parameters.

		%% Cell type:code id: tags:

		``` python
		#print_simulation_info("simulate_motif_density_localization")
		from simulations import simulate_motif_density_localization
		print(simulate_motif_density_localization.__doc__)
		```

		%% Output


		Simulates two classes of seqeuences:
		- Positive class sequences with multiple motif instances
		in center of the sequence.
		- Negative class sequences with multiple motif instances
		anywhere in the sequence.
		The number of motif instances is uniformly sampled
		between minimum and maximum motif counts.

		Parameters
		----------
		motif_name : str
		encode motif name
		seq_length : int
		length of sequence
		center_size : int
		length of central part of the sequence where motifs can be positioned
		min_motif_counts : int
		minimum number of motif instances
		max_motif_counts : int
		maximum number of motif instances
		num_pos : int
		number of positive class sequences
		num_neg : int
		number of negative class sequences
		GC_fraction : float
		GC fraction in background sequence

		Returns
		-------
		sequence_arr : 1darray
		Contains sequence strings.
		y : 1darray
		Contains labels.
		embedding_arr: 1darray
		Array of embedding objects.


		%% Cell type:markdown id: tags:

		Next, we define parameters for a TAL1 motif density localization in 1500bp long sequence, with 0.4 GC fraction, and 2-4 instances of the motif in the central 150bp for the positive sequences. We simulate a total of 3000 positive and 3000 negative sequences.

		%% Cell type:code id: tags:

		``` python
		motif_density_localization_simulation_parameters = {
		"motif_name": "TAL1_known4",
		"seq_length": 1000,
		"center_size": 150,
		"min_motif_counts": 2,
		"max_motif_counts": 4,
		"num_pos": 3000,
		"num_neg": 3000,
		"GC_fraction": 0.4}
		```

		%% Cell type:markdown id: tags:

		We get the simulation data by calling the get_simulation_data function with the simulation name and the simulation parameters as inputs. 1000 sequences are held out for a test set, 1000 sequences for a validation set, and the remaining 4000 sequences are in the training set.

		%% Cell type:code id: tags:

		``` python
		#simulation_data = get_simulation_data("simulate_motif_density_localization",
		# motif_density_localization_simulation_parameters,
		# validation_set_size=1000, test_set_size=1000)
		sequences, y, embed = simulate_motif_density_localization(**motif_density_localization_simulation_parameters)
		```

		%% Cell type:code id: tags:

		``` python
		import deepchem as dc
		from utils import one_hot_encode
		from sklearn.model_selection import train_test_split

		splitter = dc.splits.RandomSplitter()
		X = one_hot_encode(sequences)

		X_train, X_rem, y_train, y_rem, embed_train, embed_rem = train_test_split(X, y, embed, test_size=.3)
		X_valid, X_test, y_valid, y_test, embed_valid, embed_test = train_test_split(X_rem, y_rem, embed_rem, test_size=.5)

		train = dc.data.NumpyDataset(X_train, y_train)
		valid = dc.data.NumpyDataset(X_valid, y_valid)
		test = dc.data.NumpyDataset(X_test, y_test)
		print(type(X), type(y), type(embed))
		print(X.shape, y.shape, len(embed))
		print(sequences[:1])
		print(y[:1])
		print([print(emb) for emb in embed[0]])

		print("X_train.shape")
		print(X_train.shape)
		print("X_valid.shape")
		print(X_valid.shape)
		print("X_test.shape")
		print(X_test.shape)
		print("y_train.shape")
		print(y_train.shape)
		print("y_valid.shape")
		print(y_valid.shape)
		print("y_test.shape")
		print(y_test.shape)

		print(X.shape)
		print(X[0, :, :, :10])
		```

		%% Output

		<class 'numpy.ndarray'> <class 'numpy.ndarray'> <class 'list'>
		(6000, 1, 4, 1000) (6000, 1) 6000
		[ 'CAATCATTATCTTGCCATGTTGAAAGGGATAACATTTAGCATGATGACAATTTGGTCATAATAATGAGCTATTTTGTGGAGTGGGTGGAAATTTAAGTTGTACCTGGGGTTAATCAAATGACATATCTTGCATTATGAGAGTCGATCGGTATAAGGACCCCGTCCGGCATACGTTAGTACAATCATTTGATTTATACAGTTAGCGTAGCAAAGCGTATTCCTTGTATCTGATATTTTAACTGGCAATGATAATACCGGATTCGGTTAGGAACGGGTTCGATGCGTTCCGCCATAGTCTATTTAATCATATTTTACGTTATTGTCCACGTACTCGCTATTAGGTTCCTGTGTAAGTACCATTGGGTTTGCTGAATTTGCCGTGGAAGGAGGAAGCCATCTATCAATAAAAGAATAGCGCCTACTCCCGCTCATAGAGCCTATAGAAAGATCAGCTGGACTCCGGCATATCAGCTGGAGATCAGATGGCGCTCACCAGCAATGTAAATTTTAAGAGAGTGCTTGTTAACTTATGGAGATAATTTTTAAGCTCAGCTGAAGAAGCAGCTGGAAACTTAGTGGCCAATGCGAGATCTTGTGCGTTATGTGTGATCTAGGCGCTTTCGAATCGGTTATGGGGTCAATCTATAACTAATACATATTTATACATGTGACTAAGTGAATAAATCTGTTTATGACAGTGGTCCACCCCGTAATGTGAATCTGTTTAAGGTATAAGGGCTTTTAAAAATATACTATACATTCTAATCTGCTTCACCAAGTCGAAACCTCACTCGAATGTAACCGGAGAGTAAAAGATTTCTCTGTTTTGCAGAAACTGATCGTATTCCTGACAGAAGGATATTTAATTATTGGCTCTTTTCAGTTACTCAACGCCAAACTAATTTAGGCTCATTTCATAAGATCTAAATGGCCTAACGTCATCTTAAGCAGACAAATGATGTTATTTGGTAATAGATGCGGTTCAAATGAATGGAACTTA']
		[ 'ATTACCGTAATCTACTATTAAGTCACAACCAAACAATGGATTACTTTCTGCGTTGGAATCAGTGCCGTGCCAATTGCAGTTGTAGTGCAGTATTTTTGGCATGAGCCCGGGCAAAGTTTTATGAAATAAGCAAGAATCCCACCAATGAGTAAATATGGATTGAGCGCGAATTCTCTTCAATATTGATTTGCCAGCAAGACCTTAACTTCAGTTCTGCTATAATATGTCCATGTTAGAAATTTCGTCGAAATGTCATTAGAATAATCAAATATCTTAACAGAATAGCCATTTAAGTGGGAGCAAATCGGTCCCCTTCCGGGAAAAGAAGATCTAGTATATTTCAGTATATACTTTTTGACATGAATTGGTCCCAAGCGACAAATCCGAAGGGCGTACCACGTTAGAAGTATTCGTCTTGTTTGAAAGTAGTCGAACCAGCTGATTGTCAGCTGGTATTAGCAAATAACAATCAGGTGGATTTGCTTGCAATATGTTTAGGCGCGACCCCTGCGCAAAGTTGTCTCTTCAGGCACCCTGTTTCTGGGCGGTTTTTGCATTGATCCAGTCCGAGTGCGGAATAATGCGTAGTTAGAGATTCATTTATGCCTTAGATCTTGAGAATCTATTCACCAAAGTAATTTGCTAGACCCAATCGATCGTTATTCCGTCTTATAAGACACTTAATAAAAAGTGTGCGTGAATGCGGGACCATCGTTTTATATTTAATCCTAGCACTTAGATAACTCATTTCCAGCCTTGAGCTTTTGTTAACGACGATGCACCTGGATCCATCTTTTTTTAGGTTTTCTGCCTGAAGGTACTAAGAAAGGAATGATACTTATGTTCGGTTTTTAATTGCCTTGATCAAGAACTTTGGGGGTTCGCCGGGTAGTTGCCTTGTTTCCTCTCCCACCAAATGATAGTGCTCTCTTAATAAACTCTTAGTGGCTTAGCGAAACTTTGACGAGTATACTAGCTTACTGATCCATGTTAGCTTAAT']
		[[ True]]
		pos-559_revComp-TAL1_known4-AGCAGCTGGA
		pos-477_revComp-TAL1_known4-ATCAGATGGC
		pos-466_revComp-TAL1_known4-ATCAGCTGGA
		pos-447_revComp-TAL1_known4-ATCAGCTGGA
		[None, None, None, None]
		pos-433_TAL1_known4-ACCAGCTGAT
		pos-468_revComp-TAL1_known4-ATCAGGTGGA
		pos-444_revComp-TAL1_known4-GTCAGCTGGT
		[None, None, None]
		X_train.shape
		(4200, 1, 4, 1000)
		X_valid.shape
		(900, 1, 4, 1000)
		X_test.shape
		(900, 1, 4, 1000)
		y_train.shape
		(4200, 1)
		y_valid.shape
		(900, 1)
		y_test.shape
		(900, 1)
		(6000, 1, 4, 1000)
		[[[0 1 1 0 0 1 0 0 1 0]
		[1 0 0 0 1 0 0 0 0 0]
		[0 0 0 0 0 0 0 0 0 0]
		[0 0 0 1 0 0 1 1 0 1]]]
		[[[1 0 0 1 0 0 0 0 1 1]
		[0 0 0 0 1 1 0 0 0 0]
		[0 0 0 0 0 0 1 0 0 0]
		[0 1 1 0 0 0 0 1 0 0]]]

		%% Cell type:markdown id: tags:

		simulation_data provides training, validation, and test sets of input sequences X and sequence labels y. The inputs X are matrices with a one-hot-encoding of the sequences:
		![one hot encoding](./tutorial_images/one_hot_encoding.png)
		Here are the first 10bp of a sequence in our training data:

		%% Cell type:code id: tags:

		``` python
		#simulation_data.X_train[0, :, :, :10]
		X_train[0, :, :, :10]
		```

		%% Output

		array([[[0, 1, 1, 0, 0, 0, 0, 0, 0, 0],
		[0, 0, 0, 0, 0, 1, 1, 0, 1, 0],
		[0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
		[1, 0, 0, 0, 1, 0, 0, 1, 0, 1]]], dtype=int32)
		array([[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
		[0, 0, 0, 1, 0, 0, 1, 0, 1, 1],
		[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
		[1, 1, 0, 0, 1, 1, 0, 0, 0, 0]]], dtype=int32)

		%% Cell type:markdown id: tags:

		This matrix represent the 10bp sequence AAATGGGCCG.

		# The homotypic motif density localization task
		The goal of the model is to take the positive and negative sequences simulated above and classify them:
		![classificatioin task](./tutorial_images/homotypic_motif_density_localization_task.jpg)

		%% Cell type:markdown id: tags:

		# DragoNN Models

		A locally connected linear unit in a DragoNN model can represent a PSSM (part a). A sequence PSSM score is obtained by multiplying the PSSM across the sequence, thersholding the PSSM scores, and taking the max (part b). A PSSM score can also be computed by a DragoNN model with tiled locally connected linear units, amounting to a convolutional layer with a single convolutional filter representing the PSSM, followed by ReLU thersholding and maxpooling (part c).
		![dragonn vs pssm](./tutorial_images/dragonn_and_pssm.jpg)
		By utilizing multiple convolutional layers with multiple convolutional filters, DragoNN models can represent a wide range of sequence features in a compositional fashion:
		![dragonn model figure](./tutorial_images/dragonn_model_figure.jpg)

		%% Cell type:markdown id: tags:

		# Getting a DragoNN model

		The main DragoNN model class is SequenceDNN, which provides a simple interface to a range of models and methods to train, test, and interpret DragoNNs. SequenceDNN uses [keras](http://keras.io/), a deep learning library for [Theano](https://github.com/Theano/Theano) and [TensorFlow](https://github.com/tensorflow/tensorflow), which are popular software packages for deep learning.

		To get a DragoNN model we:

		1) Define the DragoNN architecture parameters
		- obtain description of architecture parameters using the inspect_SequenceDNN() function
		2) Call the get_SequenceDNN function, which takes as input the DragoNN architecture parameters, and outputs a
		randomly initialized DragoNN model.

		%% Cell type:markdown id: tags:

		To get a description of the architecture parameters we use the inspect_SequenceDNN function, which outputs documentation for the model class including the architecture parameters:

		%% Cell type:code id: tags:

		``` python
		#inspect_SequenceDNN()
		from deepchem.models.tensorgraph.models.sequence_dnn import SequenceDNN
		print(SequenceDNN.__doc__)
		```

		%% Output


		Sequence DNN models.

		Parameters
		----------
		seq_length : int
		length of input sequence.
		num_tasks : int, optional
		number of tasks. Default: 1.
		num_filters : list[int] \| tuple[int]
		number of convolutional filters in each layer. Default: (15,).
		conv_width : list[int] \| tuple[int]
		width of each layer's convolutional filters. Default: (15,).
		pool_width : int
		width of max pooling after the last layer. Default: 35.
		L1 : float
		strength of L1 penalty.
		dropout : float
		dropout probability in every convolutional layer. Default: 0.
		verbose: bool
		Verbose print statements activated if true.


		%% Cell type:markdown id: tags:

		"Available methods" display what can be done with a SequenceDNN model. These include common operations such as training and testing the model, and more complex operations such as extracting insight from trained models. We define a simple DragoNN model with one convolutional layer with one convolutional filter, followed by maxpooling of width 35.

		%% Cell type:code id: tags:

		``` python
		one_filter_dragonn_parameters = {
		'seq_length': 1000,
		'num_filters': [1],
		'conv_width': [10],
		'pool_width': 35}
		```

		%% Cell type:markdown id: tags:

		we get a randomly initialized DragoNN model by calling the get_SequenceDNN function with one_filter_dragonn_parameters as the input

		%% Cell type:code id: tags:

		``` python
		one_filter_dragonn = SequenceDNN(**one_filter_dragonn_parameters)
		seq_dnn = SequenceDNN(1000, num_filters=1)
		```

		%% Cell type:markdown id: tags:

		## Training a DragoNN model

		Next, we train the one_filter_dragonn by calling train_SequenceDNN with one_filter_dragonn and simulation_data as the inputs. In each epoch, the one_filter_dragonn will perform a complete pass over the training data, and update its parameters to minimize the loss, which quantifies the error in the model predictions. After each epoch, the code prints performance metrics for the one_filter_dragonn on the validation data. Training stops once the loss on the validation stops improving for multiple consecutive epochs. The performance metrics include balanced accuracy, area under the receiver-operating curve ([auROC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)), are under the precision-recall curve ([auPRC](https://en.wikipedia.org/wiki/Precision_and_recall)), and recall for multiple false discovery rates (Recall at [FDR](https://en.wikipedia.org/wiki/False_discovery_rate)).

		%% Cell type:code id: tags:

		``` python
		#train_SequenceDNN(one_filter_dragonn, simulation_data)
		#seq_dnn.build()
		seq_dnn.fit(train, "binary_crossentropy", nb_epoch=1)
		```

		%% Output

		Ending global_step 42: Average loss 0
		TIMING: model fitting took 79.998 s

		%% Cell type:markdown id: tags:

		A single layer, single filter model gets good performance and doesn't overfit much. Let's look at the learning curve to demonstrate this visually:

		%% Cell type:code id: tags:

		``` python
		SequenceDNN_learning_curve(one_filter_dragonn)
		```

		%% Output

		---------------------------------------------------------------------------
		NameError Traceback (most recent call last)
		<ipython-input-11-7d6a6884d584> in <module>()
		----> 1 SequenceDNN_learning_curve(one_filter_dragonn)

		NameError: name 'SequenceDNN_learning_curve' is not defined

		%% Cell type:markdown id: tags:

		# A multi-filter DragoNN model
		Next, we modify the model to have 15 convolutional filters instead of just one filter. How does this model compare to the single filter model?

		%% Cell type:code id: tags:

		``` python
		multi_filter_dragonn_parameters = {
		'seq_length': 1000,
		'num_filters': [15], ## notice the change from 1 filter to 15 filters
		'conv_width': [10],
		'pool_width': 35}
		multi_filter_dragonn = get_SequenceDNN(multi_filter_dragonn_parameters)
		train_SequenceDNN(multi_filter_dragonn, simulation_data)
		SequenceDNN_learning_curve(multi_filter_dragonn)
		```

		%% Cell type:markdown id: tags:

		It slightly outperforms the single filter model. Let's check if the learned filters capture the simulated pattern.

		%% Cell type:code id: tags:

		``` python
		interpret_SequenceDNN_filters(multi_filter_dragonn, simulation_data)
		```

		%% Cell type:markdown id: tags:

		Only some of the filters closesly match the simulated pattern. This illustrates that interpreting model parameters directly works partially for multi-filter models. Another way to deduce learned patterns is to examine feature importances for specific examples. Next, we explore methods for feature importance scoring.

		%% Cell type:markdown id: tags:

		# Interpreting data with a DragoNN model

		Using in-silico mutagenesis (ISM) and [DeepLIFT](https://arxiv.org/pdf/1605.01713v2.pdf), we can obtain scores for specific sequence indicating the importance of each position in the sequence. To assess these methods we compare ISM and DeepLIFT scores to motif scores for each simulated motif at each position in the sequence. These motif scores represent the "ground truth" importance of each position because they are based on the motifs used to simulate the data. We plot provide comaprisons for a positive class sequence on the left and a negative class sequence on the right.

		%% Cell type:code id: tags:

		``` python
		interpret_data_with_SequenceDNN(multi_filter_dragonn, simulation_data)
		```

		%% Cell type:markdown id: tags:

		In the positive example (left side), ISM correctly highlights the two motif instances in the central 150bp. DeepLIFT highlights them as well. DeepLIFT also slightly highlights false positive feature on the left side but its score is sufficiently small that we can discriminate between the false positive feature and the true positive features. In the negative example (right side), ISM doesn't highlight anything but DeepLIFT a couple false positive feature almost as much as it highlights true positive features in the positive example.

		%% Cell type:markdown id: tags:

		# A multi-layer DragoNN model
		Next, we train a 3 layer model for this task. Will it outperform the single layer model and to what extent will it overfit?

		%% Cell type:code id: tags:

		``` python
		multi_layer_dragonn_parameters = {
		'seq_length': 1000,
		'num_filters': [15, 15, 15], ## notice the change to multiple filter values, one for each layer
		'conv_width': [10, 10, 10], ## convolutional filter width has been modified to 25 from 45
		'pool_width': 35}

		multi_layer_dragonn = get_SequenceDNN(multi_layer_dragonn_parameters)
		train_SequenceDNN(multi_layer_dragonn, simulation_data)
		SequenceDNN_learning_curve(multi_layer_dragonn)
		```

		%% Cell type:markdown id: tags:

		This model performs about the same as the single layer model but it overfits more. We will try to address that with dropout regularization. But first, what do the first layer filters look like?

		%% Cell type:code id: tags:

		``` python
		interpret_SequenceDNN_filters(multi_layer_dragonn, simulation_data)
		```

		%% Cell type:markdown id: tags:

		The filters now make less sense than in the single layer model case. In multi-layered models, sequence features are learned compositionally across the layers. As a result, sequence filters in the first layer focus more on simple features that can be combined in higher layers to learn motif features more efficiently, and their interpretation becomes less clear based on simple visualizations. Let's see where ISM and DeepLIFT get us with this model.

		%% Cell type:code id: tags:

		``` python
		interpret_data_with_SequenceDNN(multi_layer_dragonn, simulation_data)
		```

		%% Cell type:markdown id: tags:

		As in the single layer model case, ISM correctly highlights the two true positive features in the positive example (left side) and correctly ignores features in the negative example (right side). DeepLIFT still highlight the same false positive feature example in the positive example as before, but we can still separate it from the true positive features. In the negative example, it still highlights some false positive features.

		%% Cell type:markdown id: tags:

		# A regularized multi-layer DragoNN model
		Next, we regularize the 3 layer using 0.2 dropout on every convolutional layer. Will dropout improve validation performance?

		%% Cell type:code id: tags:

		``` python
		regularized_multi_layer_dragonn_parameters = {
		'seq_length': 1000,
		'num_filters': [15, 15, 15],
		'conv_width': [10, 10, 10],
		'pool_width': 35,
		'dropout': 0.2} ## we introduce dropout of 0.2 on every convolutional layer for regularization
		regularized_multi_layer_dragonn = get_SequenceDNN(regularized_multi_layer_dragonn_parameters)
		train_SequenceDNN(regularized_multi_layer_dragonn, simulation_data)
		SequenceDNN_learning_curve(regularized_multi_layer_dragonn)
		```

		%% Cell type:markdown id: tags:

		As expected, dropout decreased the overfitting this model displayed previously and increased validation performance. Let's see the effect on feature discovery.

		%% Cell type:code id: tags:

		``` python
		interpret_data_with_SequenceDNN(regularized_multi_layer_dragonn, simulation_data)
		```

		%% Cell type:markdown id: tags:

		ISM now highlights a false positive feature in the positive example (left side) more than the true positive features. What happened? A sufficiently accurate model should not change its confidence that there are 2 or more features in the central 150 base pairs (bps) due to a single bp change. So it makes sense that in the limit of the "perfect" model ISM will actually lose its power to discover features in this example.

		How about DeepLIFT? DeepLIFT correctly highlights the only two positive features in the positive example. So it seems that in the limit of the "perfect" model, DeepLIFT gets closer to the true positive features.

		Why did this happen? Why, as we regularize the model and improve the performance, ISM fails to highlight the true positive features? Here is a hint: in the limit of the "perfect" model for this simulation, will a single base pair perturbation to the positive example here change its confidence that it is still a positive example? I encourage you to open github issues on the dragonn repo to discuss these questions.

		Below is an overview of patterns and simulations for further exploration.

		%% Cell type:markdown id: tags:

		# For further exploration
		In this tutorial we explored modeling of homotypic motif density. Other properties of regulatory DNA sequence include
		![sequence properties 3](./tutorial_images/sequence_properties_3.jpg)
		![sequence properties 4](./tutorial_images/sequence_properties_4.jpg)

		DragoNN provides simulations that formulate learning these patterns into classification problems:
		![sequence](./tutorial_images/sequence_simulations.png)

		You can view the available simulation functions by running print_available_simulations:

		%% Cell type:code id: tags:

		``` python
		print_available_simulations()
		```

contrib/dragonn/tutorial_utils.py

+8 −8

Original line number	Diff line number	Diff line
		@@ -96,16 +96,16 @@ from matplotlib.lines import Line2D
		# print(method_name)


		def get_SequenceDNN(SequenceDNN_parameters):
		return SequenceDNN(**SequenceDNN_parameters)
		#def get_SequenceDNN(SequenceDNN_parameters):
		# return SequenceDNN(**SequenceDNN_parameters)


		def train_SequenceDNN(dnn, simulation_data):
		assert issubclass(type(simulation_data), tuple)
		random.seed(1)
		np.random.seed(1)
		dnn.train(simulation_data.X_train, simulation_data.y_train,
		(simulation_data.X_valid, simulation_data.y_valid))
		#def train_SequenceDNN(dnn, simulation_data):
		# assert issubclass(type(simulation_data), tuple)
		# random.seed(1)
		# np.random.seed(1)
		# dnn.train(simulation_data.X_train, simulation_data.y_train,
		# (simulation_data.X_valid, simulation_data.y_valid))


		def SequenceDNN_learning_curve(dnn):

deepchem/models/tensorgraph/models/sequence_dnn.py

+3 −11

Original line number	Diff line number	Diff line
		@@ -14,6 +14,9 @@ class SequenceDNN(Sequential):
		"""
		Sequence DNN models.

		# TODO(rbharath): This model only supports one-conv layer. Extend
		# so that conv layers of greater depth can be implemented.

		Parameters
		----------
		seq_length : int
		@@ -41,8 +44,6 @@ class SequenceDNN(Sequential):
		num_filters=15,
		kernel_size=15,
		pool_width=35,
		GRU_size=35,
		TDD_size=15,
		L1=0,
		dropout=0.0,
		verbose=True,
		@@ -50,16 +51,7 @@ class SequenceDNN(Sequential):
		super(SequenceDNN, self).__init__(**kwargs)
		self.num_tasks = num_tasks
		self.verbose = verbose
		assert len(num_filters) == len(conv_width)
		self.add(layers.Conv2D(num_filters, kernel_size=kernel_size))
		self.add(layers.Dropout(dropout))
		self.add(layers.MaxPool2D())
		#if use_RNN:
		# num_max_pool_outputs = self.model.layers[-1].output_shape[-1]
		# self.add(Reshape((num_filters[-1], num_max_pool_outputs)))
		# self.add(Permute((2, 1)))
		# self.add(GRU(GRU_size, return_sequences=True))
		# self.add(TimeDistributedDense(TDD_size, activation='relu'))
		self.add(layers.Flatten())
		self.add(layers.Dense(self.num_tasks, activation_fn=tf.nn.relu))
		#self.add(Activation('sigmoid'))

deepchem/models/tensorgraph/tests/test_sequencednn.py

+16 −3

Original line number	Diff line number	Diff line
		@@ -8,7 +8,7 @@ class TestSequenceDNN(unittest.TestCase):
		"""Test SequenceDNN can be initialized."""
		model = dc.models.SequenceDNN(10)

		def test_seq_dnn_train(self):
		def test_seq_dnn_singlefilter_train(self):
		"""Test SequenceDNN training works."""
		X = np.random.rand(10, 1, 4, 50)
		y = np.random.randint(0, 2, size=(10, 1))
		@@ -18,5 +18,18 @@ class TestSequenceDNN(unittest.TestCase):
		# # False: num_sequences / num_negatives
		# #} if not multitask else None,
		dataset = dc.data.NumpyDataset(X, y)
		model = dc.models.SequenceDNN(50)
		model = dc.models.SequenceDNN(50, num_filters=1)
		model.fit(dataset, "binary_crossentropy", nb_epoch=1)

		def test_seq_dnn_multifilter_train(self):
		"""Test SequenceDNN training works."""
		X = np.random.rand(10, 1, 4, 50)
		y = np.random.randint(0, 2, size=(10, 1))
		# # TODO(rbharath): Add a test with per-class weighting.
		# #class_weight={
		# # True: num_sequences / num_positives,
		# # False: num_sequences / num_negatives
		# #} if not multitask else None,
		dataset = dc.data.NumpyDataset(X, y)
		model = dc.models.SequenceDNN(50, num_filters=15)
		model.fit(dataset, "binary_crossentropy", nb_epoch=1)

Admin message