Commit 07123c7f authored by peastman's avatar peastman
Browse files

Implemented KL cost annealing

parent b8f84490
Loading
Loading
Loading
Loading
+27 −1
Original line number Diff line number Diff line
@@ -54,6 +54,13 @@ class SeqToSeq(TensorGraph):
  embedding vector to have a unit Gaussian distribution.  You can then pick
  random vectors from a Gaussian distribution, and the output sequences should
  follow the same distribution as the training data.

  When training as a variational autoencoder, it is best to use KL cost
  annealing, as described in https://arxiv.org/abs/1511.06349.  The constraint
  term in the loss is initially set to 0, so the optimizer just tries to
  minimize the reconstruction loss.  Once it has made reasonable progress
  toward that, the constraint term can be gradually turned back on.  The range
  of steps over which this happens is configurable.
  """

  sequence_end = object()
@@ -68,6 +75,8 @@ class SeqToSeq(TensorGraph):
               dropout=0.0,
               reverse_input=True,
               variational=False,
               annealing_start_step=5000,
               annealing_final_step=10000,
               **kwargs):
    """Construct a SeqToSeq model.

@@ -98,6 +107,12 @@ class SeqToSeq(TensorGraph):
      if True, train the model as a variational autoencoder.  This adds random
      noise to the encoder, and also constrains the embedding to follow a unit
      Gaussian distribution.
    annealing_start_step: int
      the step (that is, batch) at which to begin turning on the constraint term
      for KL cost annealing
    annealing_final_step: int
      the step (that is, batch) at which to finish turning on the constraint term
      for KL cost annealing
    """
    super(SeqToSeq, self).__init__(
        use_queue=False, **kwargs)  # TODO can we make it work with the queue?
@@ -111,6 +126,8 @@ class SeqToSeq(TensorGraph):
    self._output_dict = dict((x, i) for i, x in enumerate(output_tokens))
    self._max_output_length = max_output_length
    self._embedding_dimension = embedding_dimension
    self._annealing_final_step = annealing_final_step
    self._annealing_start_step = annealing_start_step
    self._features = layers.Feature(shape=(None, None, len(input_tokens)))
    self._labels = layers.Label(shape=(None, None, len(output_tokens)))
    self._gather_indices = layers.Feature(
@@ -164,7 +181,16 @@ class SeqToSeq(TensorGraph):
      mean_sq = self._embedding_mean * self._embedding_mean
      stddev_sq = self._embedding_stddev * self._embedding_stddev
      kl = mean_sq + stddev_sq - layers.Log(stddev_sq) - 1
      loss += 0.5 * layers.ReduceMean(layers.ReduceSum(kl, axis=1))
      anneal_steps = self._annealing_final_step - self._annealing_start_step
      if anneal_steps > 0:
        current_step = tf.to_float(
            self.get_global_step()) - self._annealing_start_step
        anneal_frac = tf.maximum(0.0, current_step) / anneal_steps
        kl_scale = layers.TensorWrapper(
            tf.minimum(1.0, anneal_frac * anneal_frac))
      else:
        kl_scale = 1.0
      loss += 0.5 * kl_scale * layers.ReduceMean(layers.ReduceSum(kl, axis=1))
    return loss

  def fit_sequences(self,
+0 −5
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

SeqToSeq Fingerprint
--------------------

In this example, we will use a `SeqToSeq` model to generate fingerprints for classifying molecules.  This is based on the following paper, although some of the implementation details are different: Xu et al., "Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery" (https://doi.org/10.1145/3107411.3107424).

Many types of models require their inputs to have a fixed shape.  Since molecules can vary widely in the numbers of atoms and bonds they contain, this makes it hard to apply those models to them.  We need a way of generating a fixed length "fingerprint" for each molecule.  Various ways of doing this have been designed, such as Extended-Connectivity Fingerprints (ECFPs).  But in this example, instead of designing a fingerprint by hand, we will let a `SeqToSeq` model learn its own method of creating fingerprints.

A `SeqToSeq` model performs sequence to sequence translation.  For example, they are often used to translate text from one language to another.  It consists of two parts called the "encoder" and "decoder".  The encoder is a stack of recurrent layers.  The input sequence is fed into it, one token at a time, and it generates a fixed length vector called the "embedding vector".  The decoder is another stack of recurrent layers that performs the inverse operation: it takes the embedding vector as input, and generates the output sequence.  By training it on appropriately chosen input/output pairs, you can create a model that performs many sorts of transformations.

In this case, we will use SMILES strings describing molecules as the input sequences.  We will train the model as an autoencoder, so it tries to make the output sequences identical to the input sequences.  For that to work, the encoder must create embedding vectors that contain all information from the original sequence.  That's exactly what we want in a fingerprint, so perhaps those embedding vectors will then be useful as a way to represent molecules in other models!

Let's start by loading the data.  We will use the MUV dataset.  It includes 74,501 molecules in the training set, and 9313 molecules in the validation set, so it gives us plenty of SMILES strings to work with.

%% Cell type:code id: tags:

``` python
import deepchem as dc
tasks, datasets, transformers = dc.molnet.load_muv()
train_dataset, valid_dataset, test_dataset = datasets
train_smiles = train_dataset.ids
valid_smiles = valid_dataset.ids
```

%% Output

    About to load MUV dataset.
    Loading dataset from disk.
    Loading dataset from disk.
    Loading dataset from disk.

%% Cell type:markdown id: tags:

We need to define the "alphabet" for our `SeqToSeq` model, the list of all tokens that can appear in sequences.  (It's also possible for input and output sequences to have different alphabets, but since we're training it as an autoencoder, they're identical in this case.)  Make a list of every character that appears in any training sequence.

%% Cell type:code id: tags:

``` python
tokens = set()
for s in train_smiles:
  tokens = tokens.union(set(c for c in s))
tokens = sorted(list(tokens))
```

%% Cell type:markdown id: tags:

Create the model and define the optimization method to use.  In this case, learning works much better if we gradually decrease the learning rate.  We use an `ExponentialDecay` to multiply the learning rate by 0.9 after each epoch.

%% Cell type:code id: tags:

``` python
from deepchem.models.tensorgraph.optimizers import Adam, ExponentialDecay
max_length = max(len(s) for s in train_smiles)
model = dc.models.SeqToSeq(tokens,
                           tokens,
                           max_length,
                           encoder_layers=2,
                           decoder_layers=2,
                           embedding_dimension=256,
                           model_dir='fingerprint')
batches_per_epoch = len(train_smiles)/model.batch_size
model.set_optimizer(Adam(learning_rate=ExponentialDecay(0.004, 0.9, batches_per_epoch)))
```

%% Cell type:markdown id: tags:

Let's train it!  The input to `fit_sequences()` is a generator that produces input/output pairs.  On a good GPU, this should take a few hours or less.

%% Cell type:code id: tags:

``` python
def generate_sequences(epochs):
  for i in range(epochs):
    for s in train_smiles:
      yield (s, s)

model.fit_sequences(generate_sequences(40))
```

%% Output

    Ending global_step 999: Average loss 72.0029
    Ending global_step 1999: Average loss 40.7221
    Ending global_step 2999: Average loss 31.5364
    Ending global_step 3999: Average loss 26.4576
    Ending global_step 4999: Average loss 22.814
    Ending global_step 5999: Average loss 19.5248
    Ending global_step 6999: Average loss 16.4594
    Ending global_step 7999: Average loss 18.8898
    Ending global_step 8999: Average loss 13.476
    Ending global_step 9999: Average loss 11.5528
    Ending global_step 10999: Average loss 10.1594
    Ending global_step 11999: Average loss 10.6434
    Ending global_step 12999: Average loss 6.57057
    Ending global_step 13999: Average loss 6.46177
    Ending global_step 14999: Average loss 7.53559
    Ending global_step 15999: Average loss 4.95809
    Ending global_step 16999: Average loss 4.35039
    Ending global_step 17999: Average loss 3.39137
    Ending global_step 18999: Average loss 3.5216
    Ending global_step 19999: Average loss 3.08579
    Ending global_step 20999: Average loss 2.80738
    Ending global_step 21999: Average loss 2.92217
    Ending global_step 22999: Average loss 2.51032
    Ending global_step 23999: Average loss 1.86265
    Ending global_step 24999: Average loss 1.67088
    Ending global_step 25999: Average loss 1.87016
    Ending global_step 26999: Average loss 1.61166
    Ending global_step 27999: Average loss 1.40708
    Ending global_step 28999: Average loss 1.4488
    Ending global_step 29801: Average loss 1.33917
    TIMING: model fitting took 5619.924 s

%% Cell type:markdown id: tags:

Let's see how well it works as an autoencoder.  We'll run the first 500 molecules from the validation set through it, and see how many of them are exactly reproduced.

%% Cell type:code id: tags:

``` python
predicted = model.predict_from_sequences(valid_smiles[:500])
count = 0
for s,p in zip(valid_smiles[:500], predicted):
  if ''.join(p) == s:
    count += 1
print('reproduced', count, 'of 500 validation SMILES strings')
```

%% Output

    reproduced 363 of 500 validation SMILES strings

%% Cell type:markdown id: tags:

Now we'll trying using the encoder as a way to generate molecular fingerprints.  We compute the embedding vectors for all molecules in the training and validation datasets, and create new datasets that have those as their feature vectors.  The amount of data is small enough that we can just store everything in memory.

%% Cell type:code id: tags:

``` python
train_embeddings = model.predict_embeddings(train_smiles)
train_embeddings_dataset = dc.data.NumpyDataset(train_embeddings,
                                                train_dataset.y,
                                                train_dataset.w,
                                                train_dataset.ids)

valid_embeddings = model.predict_embeddings(valid_smiles)
valid_embeddings_dataset = dc.data.NumpyDataset(valid_embeddings,
                                                valid_dataset.y,
                                                valid_dataset.w,
                                                valid_dataset.ids)
```

%% Cell type:markdown id: tags:

For classification, we'll use a simple fully connected network with one hidden layer.

%% Cell type:code id: tags:

``` python
classifier = dc.models.TensorGraphMultiTaskClassifier(n_tasks=len(tasks),
                                                      n_features=256,
                                                      layer_sizes=[512])
classifier.fit(train_embeddings_dataset, nb_epoch=10)
```

%% Output

    Ending global_step 999: Average loss 829.805
    Ending global_step 1999: Average loss 450.42
    Ending global_step 2999: Average loss 326.079
    Ending global_step 3999: Average loss 265.199
    Ending global_step 4999: Average loss 246.724
    Ending global_step 5999: Average loss 224.64
    Ending global_step 6999: Average loss 202.624
    Ending global_step 7460: Average loss 213.885
    TIMING: model fitting took 19.780 s

%% Cell type:markdown id: tags:

Find out how well it worked.  Compute the ROC AUC for the training and validation datasets.

%% Cell type:code id: tags:

``` python
import numpy as np
metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean, mode="classification")
train_score = classifier.evaluate(train_embeddings_dataset, [metric], transformers)
valid_score = classifier.evaluate(valid_embeddings_dataset, [metric], transformers)
print('Training set ROC AUC:', train_score)
print('Validation set ROC AUC:', valid_score)
```

%% Output

    computed_metrics: [0.97828427249789751, 0.98705973960125326, 0.966007068438685, 0.9874401066031584, 0.97794394675150698, 0.98021719680962449, 0.95318452689781941, 0.97185747562764213, 0.96389538770053473, 0.96798988621997473, 0.9690779239145807, 0.98544402211472004, 0.97762497271338133, 0.96843239633294886, 0.97753648081489997, 0.96504683675485614, 0.93547151958366914]
    computed_metrics: [0.90790686952512678, 0.79891461649782913, 0.61900937081659968, 0.75241212956581671, 0.58678903240426017, 0.72765072765072758, 0.34929006085192693, 0.83986814712005553, 0.82379943502824859, 0.61844636844636847, 0.863620199146515, 0.68106930272108857, 0.98020477815699669, 0.85073580939032944, 0.781015678254942, 0.75399733510992673, nan]
    Training set ROC AUC: {'mean-roc_auc_score': 0.97132433878689139}
    Validation set ROC AUC: {'mean-roc_auc_score': 0.74592061629292239}

%% Cell type:code id: tags:

``` python
```