Implemented KL cost annealing (07123c7f) · Commits · 钟慕尧 / deepchem

deepchem/models/tensorgraph/models/seqtoseq.py

+27 −1

Original line number	Diff line number	Diff line
		@@ -54,6 +54,13 @@ class SeqToSeq(TensorGraph):
		embedding vector to have a unit Gaussian distribution. You can then pick
		random vectors from a Gaussian distribution, and the output sequences should
		follow the same distribution as the training data.

		When training as a variational autoencoder, it is best to use KL cost
		annealing, as described in https://arxiv.org/abs/1511.06349. The constraint
		term in the loss is initially set to 0, so the optimizer just tries to
		minimize the reconstruction loss. Once it has made reasonable progress
		toward that, the constraint term can be gradually turned back on. The range
		of steps over which this happens is configurable.
		"""

		sequence_end = object()
		@@ -68,6 +75,8 @@ class SeqToSeq(TensorGraph):
		dropout=0.0,
		reverse_input=True,
		variational=False,
		annealing_start_step=5000,
		annealing_final_step=10000,
		**kwargs):
		"""Construct a SeqToSeq model.

		@@ -98,6 +107,12 @@ class SeqToSeq(TensorGraph):
		if True, train the model as a variational autoencoder. This adds random
		noise to the encoder, and also constrains the embedding to follow a unit
		Gaussian distribution.
		annealing_start_step: int
		the step (that is, batch) at which to begin turning on the constraint term
		for KL cost annealing
		annealing_final_step: int
		the step (that is, batch) at which to finish turning on the constraint term
		for KL cost annealing
		"""
		super(SeqToSeq, self).__init__(
		use_queue=False, **kwargs) # TODO can we make it work with the queue?
		@@ -111,6 +126,8 @@ class SeqToSeq(TensorGraph):
		self._output_dict = dict((x, i) for i, x in enumerate(output_tokens))
		self._max_output_length = max_output_length
		self._embedding_dimension = embedding_dimension
		self._annealing_final_step = annealing_final_step
		self._annealing_start_step = annealing_start_step
		self._features = layers.Feature(shape=(None, None, len(input_tokens)))
		self._labels = layers.Label(shape=(None, None, len(output_tokens)))
		self._gather_indices = layers.Feature(
		@@ -164,7 +181,16 @@ class SeqToSeq(TensorGraph):
		mean_sq = self._embedding_mean * self._embedding_mean
		stddev_sq = self._embedding_stddev * self._embedding_stddev
		kl = mean_sq + stddev_sq - layers.Log(stddev_sq) - 1
		loss += 0.5 * layers.ReduceMean(layers.ReduceSum(kl, axis=1))
		anneal_steps = self._annealing_final_step - self._annealing_start_step
		if anneal_steps > 0:
		current_step = tf.to_float(
		self.get_global_step()) - self._annealing_start_step
		anneal_frac = tf.maximum(0.0, current_step) / anneal_steps
		kl_scale = layers.TensorWrapper(
		tf.minimum(1.0, anneal_frac * anneal_frac))
		else:
		kl_scale = 1.0
		loss += 0.5 * kl_scale * layers.ReduceMean(layers.ReduceSum(kl, axis=1))
		return loss

		def fit_sequences(self,

examples/notebooks/seqtoseq_fingerprint.ipynb

+0 −5

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		SeqToSeq Fingerprint
		--------------------

		In this example, we will use a `SeqToSeq` model to generate fingerprints for classifying molecules. This is based on the following paper, although some of the implementation details are different: Xu et al., "Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery" (https://doi.org/10.1145/3107411.3107424).

		Many types of models require their inputs to have a fixed shape. Since molecules can vary widely in the numbers of atoms and bonds they contain, this makes it hard to apply those models to them. We need a way of generating a fixed length "fingerprint" for each molecule. Various ways of doing this have been designed, such as Extended-Connectivity Fingerprints (ECFPs). But in this example, instead of designing a fingerprint by hand, we will let a `SeqToSeq` model learn its own method of creating fingerprints.

		A `SeqToSeq` model performs sequence to sequence translation. For example, they are often used to translate text from one language to another. It consists of two parts called the "encoder" and "decoder". The encoder is a stack of recurrent layers. The input sequence is fed into it, one token at a time, and it generates a fixed length vector called the "embedding vector". The decoder is another stack of recurrent layers that performs the inverse operation: it takes the embedding vector as input, and generates the output sequence. By training it on appropriately chosen input/output pairs, you can create a model that performs many sorts of transformations.

		In this case, we will use SMILES strings describing molecules as the input sequences. We will train the model as an autoencoder, so it tries to make the output sequences identical to the input sequences. For that to work, the encoder must create embedding vectors that contain all information from the original sequence. That's exactly what we want in a fingerprint, so perhaps those embedding vectors will then be useful as a way to represent molecules in other models!

		Let's start by loading the data. We will use the MUV dataset. It includes 74,501 molecules in the training set, and 9313 molecules in the validation set, so it gives us plenty of SMILES strings to work with.

		%% Cell type:code id: tags:

		``` python
		import deepchem as dc
		tasks, datasets, transformers = dc.molnet.load_muv()
		train_dataset, valid_dataset, test_dataset = datasets
		train_smiles = train_dataset.ids
		valid_smiles = valid_dataset.ids
		```

		%% Output

		About to load MUV dataset.
		Loading dataset from disk.
		Loading dataset from disk.
		Loading dataset from disk.

		%% Cell type:markdown id: tags:

		We need to define the "alphabet" for our `SeqToSeq` model, the list of all tokens that can appear in sequences. (It's also possible for input and output sequences to have different alphabets, but since we're training it as an autoencoder, they're identical in this case.) Make a list of every character that appears in any training sequence.

		%% Cell type:code id: tags:

		``` python
		tokens = set()
		for s in train_smiles:
		tokens = tokens.union(set(c for c in s))
		tokens = sorted(list(tokens))
		```

		%% Cell type:markdown id: tags:

		Create the model and define the optimization method to use. In this case, learning works much better if we gradually decrease the learning rate. We use an `ExponentialDecay` to multiply the learning rate by 0.9 after each epoch.

		%% Cell type:code id: tags:

		``` python
		from deepchem.models.tensorgraph.optimizers import Adam, ExponentialDecay
		max_length = max(len(s) for s in train_smiles)
		model = dc.models.SeqToSeq(tokens,
		tokens,
		max_length,
		encoder_layers=2,
		decoder_layers=2,
		embedding_dimension=256,
		model_dir='fingerprint')
		batches_per_epoch = len(train_smiles)/model.batch_size
		model.set_optimizer(Adam(learning_rate=ExponentialDecay(0.004, 0.9, batches_per_epoch)))
		```

		%% Cell type:markdown id: tags:

		Let's train it! The input to `fit_sequences()` is a generator that produces input/output pairs. On a good GPU, this should take a few hours or less.

		%% Cell type:code id: tags:

		``` python
		def generate_sequences(epochs):
		for i in range(epochs):
		for s in train_smiles:
		yield (s, s)

		model.fit_sequences(generate_sequences(40))
		```

		%% Output

		Ending global_step 999: Average loss 72.0029
		Ending global_step 1999: Average loss 40.7221
		Ending global_step 2999: Average loss 31.5364
		Ending global_step 3999: Average loss 26.4576
		Ending global_step 4999: Average loss 22.814
		Ending global_step 5999: Average loss 19.5248
		Ending global_step 6999: Average loss 16.4594
		Ending global_step 7999: Average loss 18.8898
		Ending global_step 8999: Average loss 13.476
		Ending global_step 9999: Average loss 11.5528
		Ending global_step 10999: Average loss 10.1594
		Ending global_step 11999: Average loss 10.6434
		Ending global_step 12999: Average loss 6.57057
		Ending global_step 13999: Average loss 6.46177
		Ending global_step 14999: Average loss 7.53559
		Ending global_step 15999: Average loss 4.95809
		Ending global_step 16999: Average loss 4.35039
		Ending global_step 17999: Average loss 3.39137
		Ending global_step 18999: Average loss 3.5216
		Ending global_step 19999: Average loss 3.08579
		Ending global_step 20999: Average loss 2.80738
		Ending global_step 21999: Average loss 2.92217
		Ending global_step 22999: Average loss 2.51032
		Ending global_step 23999: Average loss 1.86265
		Ending global_step 24999: Average loss 1.67088
		Ending global_step 25999: Average loss 1.87016
		Ending global_step 26999: Average loss 1.61166
		Ending global_step 27999: Average loss 1.40708
		Ending global_step 28999: Average loss 1.4488
		Ending global_step 29801: Average loss 1.33917
		TIMING: model fitting took 5619.924 s

		%% Cell type:markdown id: tags:

		Let's see how well it works as an autoencoder. We'll run the first 500 molecules from the validation set through it, and see how many of them are exactly reproduced.

		%% Cell type:code id: tags:

		``` python
		predicted = model.predict_from_sequences(valid_smiles[:500])
		count = 0
		for s,p in zip(valid_smiles[:500], predicted):
		if ''.join(p) == s:
		count += 1
		print('reproduced', count, 'of 500 validation SMILES strings')
		```

		%% Output

		reproduced 363 of 500 validation SMILES strings

		%% Cell type:markdown id: tags:

		Now we'll trying using the encoder as a way to generate molecular fingerprints. We compute the embedding vectors for all molecules in the training and validation datasets, and create new datasets that have those as their feature vectors. The amount of data is small enough that we can just store everything in memory.

		%% Cell type:code id: tags:

		``` python
		train_embeddings = model.predict_embeddings(train_smiles)
		train_embeddings_dataset = dc.data.NumpyDataset(train_embeddings,
		train_dataset.y,
		train_dataset.w,
		train_dataset.ids)

		valid_embeddings = model.predict_embeddings(valid_smiles)
		valid_embeddings_dataset = dc.data.NumpyDataset(valid_embeddings,
		valid_dataset.y,
		valid_dataset.w,
		valid_dataset.ids)
		```

		%% Cell type:markdown id: tags:

		For classification, we'll use a simple fully connected network with one hidden layer.

		%% Cell type:code id: tags:

		``` python
		classifier = dc.models.TensorGraphMultiTaskClassifier(n_tasks=len(tasks),
		n_features=256,
		layer_sizes=[512])
		classifier.fit(train_embeddings_dataset, nb_epoch=10)
		```

		%% Output

		Ending global_step 999: Average loss 829.805
		Ending global_step 1999: Average loss 450.42
		Ending global_step 2999: Average loss 326.079
		Ending global_step 3999: Average loss 265.199
		Ending global_step 4999: Average loss 246.724
		Ending global_step 5999: Average loss 224.64
		Ending global_step 6999: Average loss 202.624
		Ending global_step 7460: Average loss 213.885
		TIMING: model fitting took 19.780 s

		%% Cell type:markdown id: tags:

		Find out how well it worked. Compute the ROC AUC for the training and validation datasets.

		%% Cell type:code id: tags:

		``` python
		import numpy as np
		metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean, mode="classification")
		train_score = classifier.evaluate(train_embeddings_dataset, [metric], transformers)
		valid_score = classifier.evaluate(valid_embeddings_dataset, [metric], transformers)
		print('Training set ROC AUC:', train_score)
		print('Validation set ROC AUC:', valid_score)
		```

		%% Output

		computed_metrics: [0.97828427249789751, 0.98705973960125326, 0.966007068438685, 0.9874401066031584, 0.97794394675150698, 0.98021719680962449, 0.95318452689781941, 0.97185747562764213, 0.96389538770053473, 0.96798988621997473, 0.9690779239145807, 0.98544402211472004, 0.97762497271338133, 0.96843239633294886, 0.97753648081489997, 0.96504683675485614, 0.93547151958366914]
		computed_metrics: [0.90790686952512678, 0.79891461649782913, 0.61900937081659968, 0.75241212956581671, 0.58678903240426017, 0.72765072765072758, 0.34929006085192693, 0.83986814712005553, 0.82379943502824859, 0.61844636844636847, 0.863620199146515, 0.68106930272108857, 0.98020477815699669, 0.85073580939032944, 0.781015678254942, 0.75399733510992673, nan]
		Training set ROC AUC: {'mean-roc_auc_score': 0.97132433878689139}
		Validation set ROC AUC: {'mean-roc_auc_score': 0.74592061629292239}

		%% Cell type:code id: tags:

		``` python
		```

Admin message