Commit 748a7347 authored by Peter Eastman's avatar Peter Eastman
Browse files

Bug fixes

parent 34f1d108
Loading
Loading
Loading
Loading
+1 −1
Original line number Diff line number Diff line
@@ -46,7 +46,7 @@ def load_muv(featurizer='ECFP',
    loaded, all_dataset, transformers = deepchem.utils.save.load_dataset_from_disk(
        save_folder)
    if loaded:
      return muv_tasks, all_dataset, transformers
      return MUV_tasks, all_dataset, transformers

  dataset_file = os.path.join(data_dir, "muv.csv.gz")
  if not os.path.exists(dataset_file):
+3 −1
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

SeqToSeq Fingerprint
--------------------

In this example, we will use a `SeqToSeq` model to generate fingerprints for classifying molecules.  This is based on the following paper, although some of the implementation details are different: Xu et al., "Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery" (https://doi.org/10.1145/3107411.3107424).

Many types of models require their inputs to have a fixed shape.  Since molecules can vary widely in the numbers of atoms and bonds they contain, this makes it hard to apply those models to them.  We need a way of generating a fixed length "fingerprint" for each molecule.  Various ways of doing this have been designed, such as Extended-Connectivity Fingerprints (ECFPs).  But in this example, instead of designing a fingerprint by hand, we will let a `SeqToSeq` model learn its own method of creating fingerprints.

A `SeqToSeq` model performs sequence to sequence translation.  For example, they are often used to translate text from one language to another.  It consists of two parts called the "encoder" and "decoder".  The encoder is a stack of recurrent layers.  The input sequence is fed into it, one token at a time, and it generates a fixed length vector called the "embedding vector".  The decoder is another stack of recurrent layers that performs the inverse operation: it takes the embedding vector as input, and generates the output sequence.  By training it on appropriately chosen input/output pairs, you can create a model that performs many sorts of transformations.

In this case, we will use SMILES strings describing molecules as the input sequences.  We will train the model as an autoencoder, so it tries to make the output sequences identical to the input sequences.  For that to work, the encoder must create embedding vectors that contain all information from the original sequence.  That's exactly what we want in a fingerprint, so perhaps those embedding vectors will then be useful as a way to represent molecules in other models!

Let's start by loading the data.  We will use the MUV dataset.  It includes 74,501 molecules in the training set, and 9313 molecules in the validation set, so it gives us plenty of SMILES strings to work with.

%% Cell type:code id: tags:

``` python
import deepchem as dc
tasks, datasets, transformers = dc.molnet.load_muv()
train_dataset, valid_dataset, test_dataset = datasets
train_smiles = train_dataset.ids
valid_smiles = valid_dataset.ids
```

%% Output

    About to load MUV dataset.
    Loading dataset from disk.
    Loading dataset from disk.
    Loading dataset from disk.

%% Cell type:markdown id: tags:

We need to define the "alphabet" for our `SeqToSeq` model, the list of all tokens that can appear in sequences.  (It's also possible for input and output sequences to have different alphabets, but since we're training it as an autoencoder, they're identical in this case.)  Make a list of every character that appears in any training sequence.

%% Cell type:code id: tags:

``` python
tokens = set()
for s in train_smiles:
  tokens = tokens.union(set(c for c in s))
tokens = sorted(list(tokens))
```

%% Cell type:markdown id: tags:

Create the model and define the optimization method to use.  In this case, learning works much better if we gradually decrease the learning rate.  We use an `ExponentialDecay` to multiply the learning rate by 0.9 after each epoch.

%% Cell type:code id: tags:

``` python
from deepchem.models.optimizers import Adam, ExponentialDecay
max_length = max(len(s) for s in train_smiles)
batches_per_epoch = len(train_smiles)/model.batch_size
batch_size = 100
batches_per_epoch = len(train_smiles)/batch_size
model = dc.models.SeqToSeq(tokens,
                           tokens,
                           max_length,
                           encoder_layers=2,
                           decoder_layers=2,
                           embedding_dimension=256,
                           model_dir='fingerprint',
                           batch_size=batch_size,
                           learning_rate=ExponentialDecay(0.004, 0.9, batches_per_epoch))
```

%% Cell type:markdown id: tags:

Let's train it!  The input to `fit_sequences()` is a generator that produces input/output pairs.  On a good GPU, this should take a few hours or less.

%% Cell type:code id: tags:

``` python
def generate_sequences(epochs):
  for i in range(epochs):
    for s in train_smiles:
      yield (s, s)

model.fit_sequences(generate_sequences(40))
```

%% Output

    Ending global_step 999: Average loss 72.0029
    Ending global_step 1999: Average loss 40.7221
    Ending global_step 2999: Average loss 31.5364
    Ending global_step 3999: Average loss 26.4576
    Ending global_step 4999: Average loss 22.814
    Ending global_step 5999: Average loss 19.5248
    Ending global_step 6999: Average loss 16.4594
    Ending global_step 7999: Average loss 18.8898
    Ending global_step 8999: Average loss 13.476
    Ending global_step 9999: Average loss 11.5528
    Ending global_step 10999: Average loss 10.1594
    Ending global_step 11999: Average loss 10.6434
    Ending global_step 12999: Average loss 6.57057
    Ending global_step 13999: Average loss 6.46177
    Ending global_step 14999: Average loss 7.53559
    Ending global_step 15999: Average loss 4.95809
    Ending global_step 16999: Average loss 4.35039
    Ending global_step 17999: Average loss 3.39137
    Ending global_step 18999: Average loss 3.5216
    Ending global_step 19999: Average loss 3.08579
    Ending global_step 20999: Average loss 2.80738
    Ending global_step 21999: Average loss 2.92217
    Ending global_step 22999: Average loss 2.51032
    Ending global_step 23999: Average loss 1.86265
    Ending global_step 24999: Average loss 1.67088
    Ending global_step 25999: Average loss 1.87016
    Ending global_step 26999: Average loss 1.61166
    Ending global_step 27999: Average loss 1.40708
    Ending global_step 28999: Average loss 1.4488
    Ending global_step 29801: Average loss 1.33917
    TIMING: model fitting took 5619.924 s

%% Cell type:markdown id: tags:

Let's see how well it works as an autoencoder.  We'll run the first 500 molecules from the validation set through it, and see how many of them are exactly reproduced.

%% Cell type:code id: tags:

``` python
predicted = model.predict_from_sequences(valid_smiles[:500])
count = 0
for s,p in zip(valid_smiles[:500], predicted):
  if ''.join(p) == s:
    count += 1
print('reproduced', count, 'of 500 validation SMILES strings')
```

%% Output

    reproduced 363 of 500 validation SMILES strings

%% Cell type:markdown id: tags:

Now we'll trying using the encoder as a way to generate molecular fingerprints.  We compute the embedding vectors for all molecules in the training and validation datasets, and create new datasets that have those as their feature vectors.  The amount of data is small enough that we can just store everything in memory.

%% Cell type:code id: tags:

``` python
train_embeddings = model.predict_embeddings(train_smiles)
train_embeddings_dataset = dc.data.NumpyDataset(train_embeddings,
                                                train_dataset.y,
                                                train_dataset.w,
                                                train_dataset.ids)

valid_embeddings = model.predict_embeddings(valid_smiles)
valid_embeddings_dataset = dc.data.NumpyDataset(valid_embeddings,
                                                valid_dataset.y,
                                                valid_dataset.w,
                                                valid_dataset.ids)
```

%% Cell type:markdown id: tags:

For classification, we'll use a simple fully connected network with one hidden layer.

%% Cell type:code id: tags:

``` python
classifier = dc.models.MultitaskClassifier(n_tasks=len(tasks),
                                                      n_features=256,
                                                      layer_sizes=[512])
classifier.fit(train_embeddings_dataset, nb_epoch=10)
```

%% Output

    Ending global_step 999: Average loss 829.805
    Ending global_step 1999: Average loss 450.42
    Ending global_step 2999: Average loss 326.079
    Ending global_step 3999: Average loss 265.199
    Ending global_step 4999: Average loss 246.724
    Ending global_step 5999: Average loss 224.64
    Ending global_step 6999: Average loss 202.624
    Ending global_step 7460: Average loss 213.885
    TIMING: model fitting took 19.780 s

%% Cell type:markdown id: tags:

Find out how well it worked.  Compute the ROC AUC for the training and validation datasets.

%% Cell type:code id: tags:

``` python
import numpy as np
metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean, mode="classification")
train_score = classifier.evaluate(train_embeddings_dataset, [metric], transformers)
valid_score = classifier.evaluate(valid_embeddings_dataset, [metric], transformers)
print('Training set ROC AUC:', train_score)
print('Validation set ROC AUC:', valid_score)
```

%% Output

    computed_metrics: [0.97828427249789751, 0.98705973960125326, 0.966007068438685, 0.9874401066031584, 0.97794394675150698, 0.98021719680962449, 0.95318452689781941, 0.97185747562764213, 0.96389538770053473, 0.96798988621997473, 0.9690779239145807, 0.98544402211472004, 0.97762497271338133, 0.96843239633294886, 0.97753648081489997, 0.96504683675485614, 0.93547151958366914]
    computed_metrics: [0.90790686952512678, 0.79891461649782913, 0.61900937081659968, 0.75241212956581671, 0.58678903240426017, 0.72765072765072758, 0.34929006085192693, 0.83986814712005553, 0.82379943502824859, 0.61844636844636847, 0.863620199146515, 0.68106930272108857, 0.98020477815699669, 0.85073580939032944, 0.781015678254942, 0.75399733510992673, nan]
    Training set ROC AUC: {'mean-roc_auc_score': 0.97132433878689139}
    Validation set ROC AUC: {'mean-roc_auc_score': 0.74592061629292239}