Commit 871d65c4 authored by Peter Eastman's avatar Peter Eastman
Browse files

Ran tutorial for full number of epochs on a fast GPU

parent c6bc2e3b
Loading
Loading
Loading
Loading
+7 −37
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Tutorial Part 16: Learning Unsupervised Embeddings for Molecules

In this tutorial, we will use a `SeqToSeq` model to generate fingerprints for classifying molecules.  This is based on the following paper, although some of the implementation details are different: Xu et al., "Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery" (https://doi.org/10.1145/3107411.3107424).

## Colab

This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/16_Learning_Unsupervised_Embeddings_for_Molecules.ipynb)

## Setup

To run DeepChem within Colab, you'll need to run the following installation commands. This will take about 5 minutes to run to completion and install your environment. You can of course run this tutorial locally if you prefer. In that case, don't run these cells since they will download and install Anaconda on your local machine. This notebook will take a few hours to run on a GPU, so we encourage you to run it on Google colab unless you have a good GPU machine available.
To run DeepChem within Colab, you'll need to run the following installation commands. This will take about 5 minutes to run to completion and install your environment. You can of course run this tutorial locally if you prefer. In that case, don't run these cells since they will download and install Anaconda on your local machine. This notebook can take up to a few hours to run on a GPU, so we encourage you to run it on Google colab unless you have a good GPU machine available.

%% Cell type:code id: tags:

``` python
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e
```

%% Output

      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    
      0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
    100  3489  100  3489    0     0   8209      0 --:--:-- --:--:-- --:--:--  8209

    add /root/miniconda/lib/python3.6/site-packages to PYTHONPATH
    all packages is already installed

    # conda environments:
    #
    base                  *  /root/miniconda
    

%% Cell type:code id: tags:

``` python
!pip install --pre deepchem
import deepchem
deepchem.__version__
```

%% Output

    Requirement already satisfied: deepchem in /usr/local/lib/python3.6/dist-packages (2.4.0rc1.dev20200805143219)
    Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from deepchem) (0.22.2.post1)
    Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.0.5)
    Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from deepchem) (0.16.0)
    Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.18.5)
    Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.4.1)
    Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->deepchem) (2018.9)
    Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas->deepchem) (2.8.1)
    Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas->deepchem) (1.15.0)

    '2.4.0-rc1.dev'

%% Cell type:markdown id: tags:

# Learning Embeddings with SeqToSeq

Many types of models require their inputs to have a fixed shape.  Since molecules can vary widely in the numbers of atoms and bonds they contain, this makes it hard to apply those models to them.  We need a way of generating a fixed length "fingerprint" for each molecule.  Various ways of doing this have been designed, such as the Extended-Connectivity Fingerprints (ECFPs) we used in earlier tutorials.  But in this example, instead of designing a fingerprint by hand, we will let a `SeqToSeq` model learn its own method of creating fingerprints.

A `SeqToSeq` model performs sequence to sequence translation.  For example, they are often used to translate text from one language to another.  It consists of two parts called the "encoder" and "decoder".  The encoder is a stack of recurrent layers.  The input sequence is fed into it, one token at a time, and it generates a fixed length vector called the "embedding vector".  The decoder is another stack of recurrent layers that performs the inverse operation: it takes the embedding vector as input, and generates the output sequence.  By training it on appropriately chosen input/output pairs, you can create a model that performs many sorts of transformations.

In this case, we will use SMILES strings describing molecules as the input sequences.  We will train the model as an autoencoder, so it tries to make the output sequences identical to the input sequences.  For that to work, the encoder must create embedding vectors that contain all information from the original sequence.  That's exactly what we want in a fingerprint, so perhaps those embedding vectors will then be useful as a way to represent molecules in other models!

Let's start by loading the data.  We will use the MUV dataset.  It includes 74,501 molecules in the training set, and 9313 molecules in the validation set, so it gives us plenty of SMILES strings to work with.

%% Cell type:code id: tags:

``` python
import deepchem as dc
tasks, datasets, transformers = dc.molnet.load_muv(split='stratified')
train_dataset, valid_dataset, test_dataset = datasets
train_smiles = train_dataset.ids
valid_smiles = valid_dataset.ids
```

%% Cell type:markdown id: tags:

We need to define the "alphabet" for our `SeqToSeq` model, the list of all tokens that can appear in sequences.  (It's also possible for input and output sequences to have different alphabets, but since we're training it as an autoencoder, they're identical in this case.)  Make a list of every character that appears in any training sequence.

%% Cell type:code id: tags:

``` python
tokens = set()
for s in train_smiles:
  tokens = tokens.union(set(c for c in s))
tokens = sorted(list(tokens))
```

%% Cell type:markdown id: tags:

Create the model and define the optimization method to use.  In this case, learning works much better if we gradually decrease the learning rate.  We use an `ExponentialDecay` to multiply the learning rate by 0.9 after each epoch.

%% Cell type:code id: tags:

``` python
from deepchem.models.optimizers import Adam, ExponentialDecay
max_length = max(len(s) for s in train_smiles)
batch_size = 100
batches_per_epoch = len(train_smiles)/batch_size
model = dc.models.SeqToSeq(tokens,
                           tokens,
                           max_length,
                           encoder_layers=2,
                           decoder_layers=2,
                           embedding_dimension=256,
                           model_dir='fingerprint',
                           batch_size=batch_size,
                           learning_rate=ExponentialDecay(0.004, 0.9, batches_per_epoch))
                           learning_rate=ExponentialDecay(0.001, 0.9, batches_per_epoch))
```

%% Cell type:markdown id: tags:

Let's train it!  The input to `fit_sequences()` is a generator that produces input/output pairs.  On a good GPU, this should take a few hours or less.

%% Cell type:code id: tags:

``` python
def generate_sequences(epochs):
  for i in range(epochs):
    for s in train_smiles:
      yield (s, s)

model.fit_sequences(generate_sequences(1))#40
model.fit_sequences(generate_sequences(40))
```

%% Cell type:markdown id: tags:

Let's see how well it works as an autoencoder.  We'll run the first 500 molecules from the validation set through it, and see how many of them are exactly reproduced.

%% Cell type:code id: tags:

``` python
predicted = model.predict_from_sequences(valid_smiles[:500])
count = 0
for s,p in zip(valid_smiles[:500], predicted):
  if ''.join(p) == s:
    count += 1
print('reproduced', count, 'of 500 validation SMILES strings')
```

%% Output

    reproduced 0 of 500 validation SMILES strings
    reproduced 161 of 500 validation SMILES strings

%% Cell type:markdown id: tags:

Now we'll trying using the encoder as a way to generate molecular fingerprints.  We compute the embedding vectors for all molecules in the training and validation datasets, and create new datasets that have those as their feature vectors.  The amount of data is small enough that we can just store everything in memory.

%% Cell type:code id: tags:

``` python
import numpy as np
train_embeddings = model.predict_embeddings(train_smiles)
train_embeddings_dataset = dc.data.NumpyDataset(train_embeddings,
                                                train_dataset.y,
                                                train_dataset.w.astype(np.float32),
                                                train_dataset.ids)

valid_embeddings = model.predict_embeddings(valid_smiles)
valid_embeddings_dataset = dc.data.NumpyDataset(valid_embeddings,
                                                valid_dataset.y,
                                                valid_dataset.w.astype(np.float32),
                                                valid_dataset.ids)
```

%% Cell type:markdown id: tags:

For classification, we'll use a simple fully connected network with one hidden layer.

%% Cell type:code id: tags:

``` python
classifier = dc.models.MultitaskClassifier(n_tasks=len(tasks),
                                                      n_features=256,
                                                      layer_sizes=[512])
classifier.fit(train_embeddings_dataset, nb_epoch=10)
```

%% Output

    0.002357203811407089
    0.0014195525646209716

%% Cell type:markdown id: tags:

Find out how well it worked.  Compute the ROC AUC for the training and validation datasets.

%% Cell type:code id: tags:

``` python
metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean, mode="classification")
train_score = classifier.evaluate(train_embeddings_dataset, [metric], transformers)
valid_score = classifier.evaluate(valid_embeddings_dataset, [metric], transformers)
print('Training set ROC AUC:', train_score)
print('Validation set ROC AUC:', valid_score)
```

%% Output

    Training set ROC AUC: {'mean-roc_auc_score': 0.8140473860164172}
    Validation set ROC AUC: {'mean-roc_auc_score': 0.6620464489144489}
    Training set ROC AUC: {'mean-roc_auc_score': 0.9598792603154332}
    Validation set ROC AUC: {'mean-roc_auc_score': 0.7251350862464794}

%% Cell type:markdown id: tags:

# Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

## Join the DeepChem Gitter
The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!