Commit 4984f61e authored by Peter Eastman's avatar Peter Eastman
Browse files

Minor fixes to notebooks

parent 4f4a6749
Loading
Loading
Loading
Loading
+0 −1
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Screening Zinc For HIV Inhibition

%% Cell type:markdown id: tags:

In this tutorial I will walk through how to efficiently screen a large compound library with DeepChem (ZINC).  Screening a large compound library using machine learning is a CPU bound pleasingly parrellel problem.  The actual code examples I will use assume the resources available are a single very big machine (like an AWS c5.18xlarge), but should be readily swappable for other systmes (like a super computing cluster).  At a high level what we will do is...

1. Create a Machine Learning Model Over Labeled Data
2. Transform ZINC into "Work-Units"
3. Create an inference script which runs predictions over a "Work-Unit"
4. Load "Work-Unit" into a "Work Queue"
5. Consume work units from "Work Queue"
6. Gather Results

# 1. Train Model On Labelled Data

We are just going to knock out a simple model here.  In a real world problem you will probably try several models and do a little hyper parameter searching.

%% Cell type:code id: tags:

``` python
from deepchem.molnet.load_function import hiv_datasets
```

%% Cell type:code id: tags:

``` python
from deepchem.models import GraphConvModel
from deepchem.data import NumpyDataset
from sklearn.metrics import average_precision_score
import numpy as np

tasks, all_datasets, transformers = hiv_datasets.load_hiv(featurizer="GraphConv")
train, valid, test = [NumpyDataset.from_DiskDataset(x) for x in all_datasets]
model = GraphConvModel(1, mode="classification")
model.fit(train)
```

%% Output

    Loading dataset from disk.
    Loading dataset from disk.
    Loading dataset from disk.

    /home/leswing/miniconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
      "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

    94.7494862112294

%% Cell type:code id: tags:

``` python
y_true = np.squeeze(valid.y)
y_pred = model.predict(valid)[:,0,1]
print("Average Precision Score:%s" % average_precision_score(y_true, y_pred))
sorted_results = sorted(zip(y_pred, y_true), reverse=True)
hit_rate_100 = sum(x[1] for x in sorted_results[:100]) / 100
print("Hit Rate Top 100: %s" % hit_rate_100)
```

%% Output

    Average Precision Score:0.22277598937482013
    Hit Rate Top 100: 0.33

%% Cell type:markdown id: tags:

## Retrain Model Over Full Dataset For The Screen

%% Cell type:code id: tags:

``` python
tasks, all_datasets, transformers = hiv_datasets.load_hiv(featurizer="GraphConv", split=None)

model = GraphConvModel(1, mode="classification", model_dir="/tmp/zinc/screen_model")
model.fit(all_datasets[0])
model.save()
```

%% Output

    Loading raw samples now.
    shard_size: 8192
    About to start loading CSV from /tmp/HIV.csv
    Loading shard 1 of size 8192.
    Featurizing sample 0
    Featurizing sample 1000
    Featurizing sample 2000
    Featurizing sample 3000
    Featurizing sample 4000
    Featurizing sample 5000
    Featurizing sample 6000
    Featurizing sample 7000
    Featurizing sample 8000
    TIMING: featurizing shard 0 took 15.701 s
    Loading shard 2 of size 8192.
    Featurizing sample 0
    Featurizing sample 1000
    Featurizing sample 2000
    Featurizing sample 3000
    Featurizing sample 4000
    Featurizing sample 5000
    Featurizing sample 6000
    Featurizing sample 7000
    Featurizing sample 8000
    TIMING: featurizing shard 1 took 15.869 s
    Loading shard 3 of size 8192.
    Featurizing sample 0
    Featurizing sample 1000
    Featurizing sample 2000
    Featurizing sample 3000
    Featurizing sample 4000
    Featurizing sample 5000
    Featurizing sample 6000
    Featurizing sample 7000
    Featurizing sample 8000
    TIMING: featurizing shard 2 took 19.106 s
    Loading shard 4 of size 8192.
    Featurizing sample 0
    Featurizing sample 1000
    Featurizing sample 2000
    Featurizing sample 3000
    Featurizing sample 4000
    Featurizing sample 5000
    Featurizing sample 6000
    Featurizing sample 7000
    Featurizing sample 8000
    TIMING: featurizing shard 3 took 16.267 s
    Loading shard 5 of size 8192.
    Featurizing sample 0
    Featurizing sample 1000
    Featurizing sample 2000
    Featurizing sample 3000
    Featurizing sample 4000
    Featurizing sample 5000
    Featurizing sample 6000
    Featurizing sample 7000
    Featurizing sample 8000
    TIMING: featurizing shard 4 took 16.754 s
    Loading shard 6 of size 8192.
    Featurizing sample 0
    TIMING: featurizing shard 5 took 0.446 s
    TIMING: dataset construction took 98.214 s
    Loading dataset from disk.
    TIMING: dataset construction took 21.127 s
    Loading dataset from disk.

    /home/leswing/miniconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
      "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

%% Cell type:markdown id: tags:

# 2. Create Work-Units

1. Download All of ZINC15.

Go to http://zinc15.docking.org/tranches/home and download all non-empty tranches in .smi format.
I found it easiest to download the wget script and then run the wget script.
For the rest of this tutorial I will assume zinc was downloaded to /tmp/zinc.


The way zinc downloads the data isn't great for inference.  We want "Work-Units" which a single CPU can execute that takes a resonable amount of time (10 minutes to an hour).  To accomplish this we are going to split the zinc data into files each with 500 thousand lines.


```bash
mkdir /tmp/zinc/screen
find /tmp/zinc -name '*.smi' -exec cat {} \; | grep -iv "smiles" \
     | split -l 500000 /tmp/zinc/screen/segment
```

This bash command
1. Finds all smi files
2. prints to stdout the contents of the file
3. removes header lines
4. splits into multiple files in /tmp/zinc/screen that are 500k molecules long

%% Cell type:markdown id: tags:

## 3. Creat Inference Script

Now that we have work unit we need to construct a program which ingests a work unit and logs the result.  It is important that the logging mechanism is thread safe!
For this example we will get the work unit via a file-path, and log the result to a file.
An easy extensions to distribute over multiple computers would be to get the work unit via a url, and log the results to a distributed queue.

Here is what mine looks like

inference.py
```python
import sys
import deepchem as dc
import numpy as np
from rdkit import Chem
import pickle
import os


def create_dataset(fname, batch_size=50000):
    featurizer = dc.feat.ConvMolFeaturizer()
    fin = open(fname)
    mols, orig_lines = [], []
    for line in fin:
        line = line.strip().split()
        try:
            mol = Chem.MolFromSmiles(line[0])
            if mol is None:
                continue
            mols.append(mol)
            orig_lines.append(line)
        except:
            pass
        if len(mols) > 0 and len(mols) % batch_size == 0:
            features = featurizer.featurize(mols)
            y = np.ones(shape=(len(mols), 1))
            ds = dc.data.NumpyDataset(features, y)
            yield ds, orig_lines
            mols, orig_lines = [], []
    if len(mols) > 0:
        features = featurizer.featurize(mols)
        y = np.ones(shape=(len(mols), 1))
        ds = dc.data.NumpyDataset(features, y)
        yield ds, orig_lines


def evaluate(fname):
    fout_name = "%s_out.smi" % fname
    model = dc.models.TensorGraph.load_from_dir('screen_model')
    for ds, lines in create_dataset(fname):
        y_pred = np.squeeze(model.predict(ds), axis=1)
        with open(fout_name, 'a') as fout:
            for index, line in enumerate(lines):
                line.append(y_pred[index][1])
                line = [str(x) for x in line]
                line = "\t".join(line)
                fout.write("%s\n" % line)


if __name__ == "__main__":
    evaluate(sys.argv[1])
```

%% Cell type:markdown id: tags:

# 4. Load "Work-Unit" into a "Work Queue"

We are going to use a flat file as our distribution mechanism.  It will be a bash script calling our inference script for every work unit.  If you are at an academic institution this would be queing your jobs in pbs/qsub/slurm.  An option for cloud computing would be rabbitmq or kafka.

%% Cell type:code id: tags:

``` python
import os
work_units = os.listdir('/tmp/zinc/screen')
with open('/tmp/zinc/work_queue.sh', 'w') as fout:
    fout.write("#!/bin/bash\n")
    for work_unit in work_units:
        full_path = os.path.join('/tmp/zinc', work_unit)
        fout.write("python inference.py %s" % full_path)
```

%% Cell type:markdown id: tags:

# 5. Consume work units from "distribution mechanism"

We will consume work units from our work queue using a very simple Process Pool.  It takes lines from our "Work Queue" and runs them, running as many processes in parrallel as we have cpus.  If  you are using a supercomputing cluster system like pbs/qsub/slurm it will take care of this for you.  The key is to use one CPU per work unit to get highest throughput.  We accomplish that here using the linux utility "taskset".

Using an c5.18xlarge on aws this will finish overnight.

process_pool.py
```python
import multiprocessing
import sys
from multiprocessing.pool import Pool

import delegator


def run_command(args):
  q, command = args
  cpu_id = q.get()
  try:
    command = "taskset -c %s %s" % (cpu_id, command)
    print("running %s" % command)
    c = delegator.run(command)
    print(c.err)
    print(c.out)
  except Exception as e:
    print(e)
  q.put(cpu_id)


def main(n_processors, command_file):
  commands = [x.strip() for x in open(command_file).readlines()]
  commands = list(filter(lambda x: not x.startswith("#"), commands))
  q = multiprocessing.Manager().Queue()
  for i in range(n_processors):
    q.put(i)
  argslist = [(q, x) for x in commands]
  pool = Pool(processes=n_processors)
  pool.map(run_command, argslist)


if __name__ == "__main__":
  processors = multiprocessing.cpu_count()
  main(processors, sys.argv[1])
```


```bash
>> python process_pool.py /tmp/zinc/work_queue.sh
```

%% Cell type:markdown id: tags:

# 6. Gather Results
Since we logged our results to \*_out.smi we now need to gather all of them up and sort them by our predictions.  The resulting file wile be > 40GB.  To analyze the data further you can use dask, or put the data in a rdkit postgres cartridge.

Here I show how to join the and sort the data to get the "best" results.

```bash
find /tmp/zinc -name '*_out.smi' -exec cat {} \; > /tmp/zinc/screen/results.smi
sort -rg -k 3,3 /tmp/zinc/screen/results.smi > /tmp/zinc/screen/sorted_results.smi
# Put the top 100k scoring molecules in their own file
head -n 50000 /tmp/zinc/screen/sorted_results.smi > /tmp/zinc/screen/top_100k.smi
```

/tmp/zinc/screen/top_100k.smi is now a small enough file to investigate using standard tools like pandas.

%% Cell type:code id: tags:

``` python
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from IPython.display import SVG
from rdkit.Chem.Draw import rdMolDraw2D
best_mols = [Chem.MolFromSmiles(x.strip().split()[0]) for x in open('/tmp/zinc/screen/top_100k.smi').readlines()[:100]]
best_scores = [x.strip().split()[2] for x in open('/tmp/zinc/screen/top_100k.smi').readlines()[:100]]
```

%% Cell type:code id: tags:

``` python
print(best_scores[0])
best_mols[0]
```

%% Output

    0.98874843

    <rdkit.Chem.rdchem.Mol at 0x7f9394ebf940>

%% Cell type:code id: tags:

``` python
print(best_scores[0])
best_mols[1]
```

%% Output

    0.98874843

    <rdkit.Chem.rdchem.Mol at 0x7f9394e7abc0>

%% Cell type:code id: tags:

``` python
print(best_scores[0])
best_mols[2]
```

%% Output

    0.98874843

    <rdkit.Chem.rdchem.Mol at 0x7f9394e818f0>

%% Cell type:code id: tags:

``` python
print(best_scores[0])
best_mols[3]
```

%% Output

    0.98874843

    <rdkit.Chem.rdchem.Mol at 0x7f9394e81990>

%% Cell type:markdown id: tags:

The screen seems to favor molecules with one or multiple sulfur trioxides. The top scoring molecules also have low diversity.  When creating a "buy list" we want to optimize for more things than just activity, for instance diversity and drug like MPO.

%% Cell type:code id: tags:

``` python
```
+2 −2
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

SeqToSeq Fingerprint
--------------------

In this example, we will use a `SeqToSeq` model to generate fingerprints for classifying molecules.  This is based on the following paper, although some of the implementation details are different: Xu et al., "Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery" (https://doi.org/10.1145/3107411.3107424).

Many types of models require their inputs to have a fixed shape.  Since molecules can vary widely in the numbers of atoms and bonds they contain, this makes it hard to apply those models to them.  We need a way of generating a fixed length "fingerprint" for each molecule.  Various ways of doing this have been designed, such as Extended-Connectivity Fingerprints (ECFPs).  But in this example, instead of designing a fingerprint by hand, we will let a `SeqToSeq` model learn its own method of creating fingerprints.

A `SeqToSeq` model performs sequence to sequence translation.  For example, they are often used to translate text from one language to another.  It consists of two parts called the "encoder" and "decoder".  The encoder is a stack of recurrent layers.  The input sequence is fed into it, one token at a time, and it generates a fixed length vector called the "embedding vector".  The decoder is another stack of recurrent layers that performs the inverse operation: it takes the embedding vector as input, and generates the output sequence.  By training it on appropriately chosen input/output pairs, you can create a model that performs many sorts of transformations.

In this case, we will use SMILES strings describing molecules as the input sequences.  We will train the model as an autoencoder, so it tries to make the output sequences identical to the input sequences.  For that to work, the encoder must create embedding vectors that contain all information from the original sequence.  That's exactly what we want in a fingerprint, so perhaps those embedding vectors will then be useful as a way to represent molecules in other models!

Let's start by loading the data.  We will use the MUV dataset.  It includes 74,501 molecules in the training set, and 9313 molecules in the validation set, so it gives us plenty of SMILES strings to work with.

%% Cell type:code id: tags:

``` python
import deepchem as dc
tasks, datasets, transformers = dc.molnet.load_muv()
train_dataset, valid_dataset, test_dataset = datasets
train_smiles = train_dataset.ids
valid_smiles = valid_dataset.ids
```

%% Output

    About to load MUV dataset.
    Loading dataset from disk.
    Loading dataset from disk.
    Loading dataset from disk.

%% Cell type:markdown id: tags:

We need to define the "alphabet" for our `SeqToSeq` model, the list of all tokens that can appear in sequences.  (It's also possible for input and output sequences to have different alphabets, but since we're training it as an autoencoder, they're identical in this case.)  Make a list of every character that appears in any training sequence.

%% Cell type:code id: tags:

``` python
tokens = set()
for s in train_smiles:
  tokens = tokens.union(set(c for c in s))
tokens = sorted(list(tokens))
```

%% Cell type:markdown id: tags:

Create the model and define the optimization method to use.  In this case, learning works much better if we gradually decrease the learning rate.  We use an `ExponentialDecay` to multiply the learning rate by 0.9 after each epoch.

%% Cell type:code id: tags:

``` python
from deepchem.models.optimizers import Adam, ExponentialDecay
max_length = max(len(s) for s in train_smiles)
model = dc.models.SeqToSeq(tokens,
                           tokens,
                           max_length,
                           encoder_layers=2,
                           decoder_layers=2,
                           embedding_dimension=256,
                           model_dir='fingerprint')
                           model_dir='fingerprint',
                           learning_rate=ExponentialDecay(0.004, 0.9, batches_per_epoch))
batches_per_epoch = len(train_smiles)/model.batch_size
model.set_optimizer(Adam(learning_rate=ExponentialDecay(0.004, 0.9, batches_per_epoch)))
```

%% Cell type:markdown id: tags:

Let's train it!  The input to `fit_sequences()` is a generator that produces input/output pairs.  On a good GPU, this should take a few hours or less.

%% Cell type:code id: tags:

``` python
def generate_sequences(epochs):
  for i in range(epochs):
    for s in train_smiles:
      yield (s, s)

model.fit_sequences(generate_sequences(40))
```

%% Output

    Ending global_step 999: Average loss 72.0029
    Ending global_step 1999: Average loss 40.7221
    Ending global_step 2999: Average loss 31.5364
    Ending global_step 3999: Average loss 26.4576
    Ending global_step 4999: Average loss 22.814
    Ending global_step 5999: Average loss 19.5248
    Ending global_step 6999: Average loss 16.4594
    Ending global_step 7999: Average loss 18.8898
    Ending global_step 8999: Average loss 13.476
    Ending global_step 9999: Average loss 11.5528
    Ending global_step 10999: Average loss 10.1594
    Ending global_step 11999: Average loss 10.6434
    Ending global_step 12999: Average loss 6.57057
    Ending global_step 13999: Average loss 6.46177
    Ending global_step 14999: Average loss 7.53559
    Ending global_step 15999: Average loss 4.95809
    Ending global_step 16999: Average loss 4.35039
    Ending global_step 17999: Average loss 3.39137
    Ending global_step 18999: Average loss 3.5216
    Ending global_step 19999: Average loss 3.08579
    Ending global_step 20999: Average loss 2.80738
    Ending global_step 21999: Average loss 2.92217
    Ending global_step 22999: Average loss 2.51032
    Ending global_step 23999: Average loss 1.86265
    Ending global_step 24999: Average loss 1.67088
    Ending global_step 25999: Average loss 1.87016
    Ending global_step 26999: Average loss 1.61166
    Ending global_step 27999: Average loss 1.40708
    Ending global_step 28999: Average loss 1.4488
    Ending global_step 29801: Average loss 1.33917
    TIMING: model fitting took 5619.924 s

%% Cell type:markdown id: tags:

Let's see how well it works as an autoencoder.  We'll run the first 500 molecules from the validation set through it, and see how many of them are exactly reproduced.

%% Cell type:code id: tags:

``` python
predicted = model.predict_from_sequences(valid_smiles[:500])
count = 0
for s,p in zip(valid_smiles[:500], predicted):
  if ''.join(p) == s:
    count += 1
print('reproduced', count, 'of 500 validation SMILES strings')
```

%% Output

    reproduced 363 of 500 validation SMILES strings

%% Cell type:markdown id: tags:

Now we'll trying using the encoder as a way to generate molecular fingerprints.  We compute the embedding vectors for all molecules in the training and validation datasets, and create new datasets that have those as their feature vectors.  The amount of data is small enough that we can just store everything in memory.

%% Cell type:code id: tags:

``` python
train_embeddings = model.predict_embeddings(train_smiles)
train_embeddings_dataset = dc.data.NumpyDataset(train_embeddings,
                                                train_dataset.y,
                                                train_dataset.w,
                                                train_dataset.ids)

valid_embeddings = model.predict_embeddings(valid_smiles)
valid_embeddings_dataset = dc.data.NumpyDataset(valid_embeddings,
                                                valid_dataset.y,
                                                valid_dataset.w,
                                                valid_dataset.ids)
```

%% Cell type:markdown id: tags:

For classification, we'll use a simple fully connected network with one hidden layer.

%% Cell type:code id: tags:

``` python
classifier = dc.models.MultitaskClassifier(n_tasks=len(tasks),
                                                      n_features=256,
                                                      layer_sizes=[512])
classifier.fit(train_embeddings_dataset, nb_epoch=10)
```

%% Output

    Ending global_step 999: Average loss 829.805
    Ending global_step 1999: Average loss 450.42
    Ending global_step 2999: Average loss 326.079
    Ending global_step 3999: Average loss 265.199
    Ending global_step 4999: Average loss 246.724
    Ending global_step 5999: Average loss 224.64
    Ending global_step 6999: Average loss 202.624
    Ending global_step 7460: Average loss 213.885
    TIMING: model fitting took 19.780 s

%% Cell type:markdown id: tags:

Find out how well it worked.  Compute the ROC AUC for the training and validation datasets.

%% Cell type:code id: tags:

``` python
import numpy as np
metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean, mode="classification")
train_score = classifier.evaluate(train_embeddings_dataset, [metric], transformers)
valid_score = classifier.evaluate(valid_embeddings_dataset, [metric], transformers)
print('Training set ROC AUC:', train_score)
print('Validation set ROC AUC:', valid_score)
```

%% Output

    computed_metrics: [0.97828427249789751, 0.98705973960125326, 0.966007068438685, 0.9874401066031584, 0.97794394675150698, 0.98021719680962449, 0.95318452689781941, 0.97185747562764213, 0.96389538770053473, 0.96798988621997473, 0.9690779239145807, 0.98544402211472004, 0.97762497271338133, 0.96843239633294886, 0.97753648081489997, 0.96504683675485614, 0.93547151958366914]
    computed_metrics: [0.90790686952512678, 0.79891461649782913, 0.61900937081659968, 0.75241212956581671, 0.58678903240426017, 0.72765072765072758, 0.34929006085192693, 0.83986814712005553, 0.82379943502824859, 0.61844636844636847, 0.863620199146515, 0.68106930272108857, 0.98020477815699669, 0.85073580939032944, 0.781015678254942, 0.75399733510992673, nan]
    Training set ROC AUC: {'mean-roc_auc_score': 0.97132433878689139}
    Validation set ROC AUC: {'mean-roc_auc_score': 0.74592061629292239}
+1 −3
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Modeling Solubility
Written by Bharath Ramsundar and Evan Feinberg

Copyright 2016, Stanford University

%% Cell type:markdown id: tags:

Computationally predicting molecular solubility through is useful for drug-discovery. In this tutorial, we will use the `deepchem` library to fit a simple statistical model that predicts the solubility of drug-like compounds. The process of fitting this model involves four steps:

1. Loading a chemical dataset, consisting of a series of compounds along with aqueous solubility measurements.
2. Transforming each compound into a feature vector $v \in \mathbb{R}^n$ comprehensible to statistical learning methods.
3. Fitting a simple model that maps feature vectors to estimates of aqueous solubility.
4. Visualizing the results.

%% Cell type:markdown id: tags:

We need to load a dataset of estimated aqueous solubility measurements [1] into deepchem. The data is in CSV format and contains SMILES strings, predicted aqueaous solubilities, and a number of extraneous (for our purposes) molecular properties. Here is an example line from the dataset:

|Compound ID|ESOL predicted log solubility in mols per litre|Minimum Degree|Molecular Weight|Number of H-Bond Donors|Number of Rings|Number of Rotatable Bonds|Polar Surface Area|measured log solubility in mols per litre|smiles|
|-----------|-----------------------------------------------|--------------|----------------|----------------------------|---------------|-------------------------|-----------------------------------------------------------------------|------|
|benzothiazole|-2.733|2|135.191|0|2|0|12.89|-1.5|c2ccc1scnc1c2|

Most of these fields are not useful for our purposes. The two fields that we will need are the "smiles" field and the "measured log solubility in mols per litre". The "smiles" field holds a SMILES string [2] that specifies the compound in question. Before we load this data into deepchem, we will load the data into python and do some simple preliminary analysis to gain some intuition for the dataset.

%% Cell type:code id: tags:

``` python
%autoreload 2
%pdb off
"""
Not Currently Working
"""
from deepchem.utils.save import load_from_disk

dataset_file= "../../datasets/delaney-processed.csv"
dataset = load_from_disk(dataset_file)
print("Columns of dataset: %s" % str(dataset.columns.values))
print("Number of examples in dataset: %s" % str(dataset.shape[0]))
```

%% Output

    ERROR:root:Line magic function `%autoreload` not found.

    Automatic pdb calling has been turned OFF

    /home/leswing/anaconda3/envs/deepchem27/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
      "This module will be removed in 0.20.", DeprecationWarning)

    Columns of dataset: ['Compound ID' 'ESOL predicted log solubility in mols per litre'
     'Minimum Degree' 'Molecular Weight' 'Number of H-Bond Donors'
     'Number of Rings' 'Number of Rotatable Bonds' 'Polar Surface Area'
     'measured log solubility in mols per litre' 'smiles']
    Number of examples in dataset: 1128

%% Cell type:markdown id: tags:

To gain a visual understanding of compounds in our dataset, let's draw them using rdkit. We define a couple of helper functions to get started.

%% Cell type:code id: tags:

``` python
import tempfile
from rdkit import Chem
from rdkit.Chem import Draw
from itertools import islice
from IPython.display import Image, HTML, display

def display_images(filenames):
    """Helper to pretty-print images."""
    imagesList=''.join(
        ["<img style='width: 140px; margin: 0px; float: left; border: 1px solid black;' src='%s' />"
         % str(s) for s in sorted(filenames)])
    display(HTML(imagesList))

def mols_to_pngs(mols, basename="test"):
    """Helper to write RDKit mols to png files."""
    filenames = []
    for i, mol in enumerate(mols):
        filename = "%s%d.png" % (basename, i)
        Draw.MolToFile(mol, filename)
        filenames.append(filename)
    return filenames
```

%% Cell type:markdown id: tags:

Now, we display some compounds from the dataset:

%% Cell type:code id: tags:

``` python
num_to_display = 14
molecules = []
for _, data in islice(dataset.iterrows(), num_to_display):
    molecules.append(Chem.MolFromSmiles(data["smiles"]))
display_images(mols_to_pngs(molecules))
```

%% Output


%% Cell type:markdown id: tags:

Analyzing the distribution of solubilities shows us a nice spread of data.

%% Cell type:code id: tags:

``` python
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt

solubilities = np.array(dataset["measured log solubility in mols per litre"])
n, bins, patches = plt.hist(solubilities, 50, facecolor='green', alpha=0.75)
plt.xlabel('Measured log-solubility in mols/liter')
plt.ylabel('Number of compounds')
plt.title(r'Histogram of solubilities')
plt.grid(True)
plt.show()
```

%% Output


%% Cell type:markdown id: tags:

With our preliminary analysis completed, we return to the original goal of constructing a predictive statistical model of molecular solubility using `deepchem`. The first step in creating such a molecule is translating each compound into a vectorial format that can be understood by statistical learning techniques. This process is commonly called featurization. `deepchem` packages a number of commonly used featurization for user convenience. In this tutorial, we will use ECPF4 fingeprints [3].

`deepchem` offers an object-oriented API for featurization. To get started with featurization, we first construct a ```Featurizer``` object. `deepchem` provides the ```CircularFingeprint``` class (a subclass of ```Featurizer``` that performs ECFP4 featurization).

%% Cell type:code id: tags:

``` python
import deepchem as dc

featurizer = dc.feat.CircularFingerprint(size=1024)
```

%% Cell type:markdown id: tags:

Now, let's perform the actual featurization. `deepchem` provides the ```CSVLoader``` class for this purpose. The ```featurize()``` method for this class loads data from disk and uses provided ```Featurizer```instances to transform the provided data into feature vectors.

To perform machine learning upon these datasets, we need to convert the samples into datasets suitable for machine-learning (that is, into data matrix $X \in \mathbb{R}^{n\times d}$ where $n$ is the number of samples and $d$ the dimensionality of the feature vector, and into label vector $y \in \mathbb{R}^n$). `deepchem` provides the `Dataset` class to facilitate this transformation. This style lends itself easily to validation-set hyperparameter searches, which we illustate below.

%% Cell type:code id: tags:

``` python
loader = dc.data.CSVLoader(
      tasks=["measured log solubility in mols per litre"], smiles_field="smiles",
      featurizer=featurizer)
dataset = loader.featurize(dataset_file)
```

%% Output

    Loading raw samples now.
    shard_size: 8192
    About to start loading CSV from ../../datasets/delaney-processed.csv
    Loading shard 1 of size 8192.
    Featurizing sample 0
    Featurizing sample 1000
    TIMING: featurizing shard 0 took 2.016 s
    TIMING: dataset construction took 2.061 s
    Loading dataset from disk.

    /home/leswing/Documents/deepchem/deepchem/data/data_loader.py:42: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
      if y[ind, task] == "":

%% Cell type:markdown id: tags:

When constructing statistical models, it's necessary to separate the provided data into train/test subsets. The train subset is used to learn the statistical model, while the test subset is used to evaluate the learned model. In practice, it's often useful to elaborate this split further and perform a train/validation/test split. The validation set is used to perform model selection. Proposed models are evaluated on the validation-set, and the best performed model is at the end tested on the test-set.

Choosing the proper method of performing a train/validation/test split can be challenging. Standard machine learning practice is to perform a random split of the data into train/validation/test, but random splits are not well suited for the purposes of chemical informatics. For our predictive models to be useful, we require them to have predictive power in portions of chemical space beyond the set of molecules in the training data. Consequently, our models should use splits of the data that separate compounds in the training set from those in the validation and test-sets. We use Bemis-Murcko scaffolds [5] to perform this separation (all compounds that share an underlying molecular scaffold will be placed into the same split in the train/test/validation split).

%% Cell type:code id: tags:

``` python
splitter = dc.splits.ScaffoldSplitter(dataset_file)
train_dataset, valid_dataset, test_dataset = splitter.train_valid_test_split(
    dataset)
```

%% Output

    Computing train/valid/test indices
    About to generate scaffolds
    Generating scaffold 0/1128
    Generating scaffold 1000/1128
    About to sort in scaffold sets
    TIMING: dataset construction took 0.058 s
    Loading dataset from disk.
    TIMING: dataset construction took 0.027 s
    Loading dataset from disk.
    TIMING: dataset construction took 0.030 s
    Loading dataset from disk.

%% Cell type:markdown id: tags:

Let's visually inspect some of the molecules in the separate splits to verify that they appear structurally dissimilar. The `FeaturizedSamples` class provides an `itersamples` method that lets us obtain the underlying compounds in each split.

%% Cell type:code id: tags:

``` python
train_mols = [Chem.MolFromSmiles(compound)
              for compound in train_dataset.ids]
display_images(mols_to_pngs(train_mols, basename="train"))
```

%% Output


%% Cell type:code id: tags:

``` python
valid_mols = [Chem.MolFromSmiles(compound)
              for compound in valid_dataset.ids]
display_images(mols_to_pngs(valid_mols, basename="valid"))
```

%% Output


%% Cell type:markdown id: tags:

Notice the visual distinction between the train/validation splits. The most-common scaffolds are reserved for the train split, with the rarer scaffolds allotted to validation/test.

%% Cell type:markdown id: tags:

The performance of common machine-learning algorithms can be very sensitive to preprocessing of the data. One common transformation applied to data is to normalize it to have zero-mean and unit-standard-deviation. We will apply this transformation to the log-solubility (as seen above, the log-solubility ranges from -12 to 2).

%% Cell type:code id: tags:

``` python
transformers = [
    dc.trans.NormalizationTransformer(transform_y=True, dataset=train_dataset)]

for dataset in [train_dataset, valid_dataset, test_dataset]:
  for transformer in transformers:
      dataset = transformer.transform(dataset)
```

%% Output

    TIMING: dataset construction took 0.042 s
    Loading dataset from disk.
    TIMING: dataset construction took 0.009 s
    Loading dataset from disk.
    TIMING: dataset construction took 0.009 s
    Loading dataset from disk.

%% Cell type:markdown id: tags:

The next step after processing the data is to start fitting simple learning models to our data. `deepchem` provides a number of machine-learning model classes.

In particular, `deepchem` provides a convenience class, ```SklearnModel``` that wraps any machine-learning model available in scikit-learn [6]. Consequently, we will start by building a simple random-forest regressor that attempts to predict the log-solubility from our computed ECFP4 features. To train the model, we instantiate the ```SklearnModel``` object, then call the ```fit()``` method on the ```train_dataset``` we constructed above. We then save the model to disk.

%% Cell type:code id: tags:

``` python
from sklearn.ensemble import RandomForestRegressor

sklearn_model = RandomForestRegressor(n_estimators=100)
model = dc.models.SklearnModel(sklearn_model)
model.fit(train_dataset)
```

%% Cell type:markdown id: tags:

We next evaluate the model on the validation set to see its predictive power. `deepchem` provides the `Evaluator` class to facilitate this process. To evaluate the constructed `model` object, create a new `Evaluator` instance and call the `compute_model_performance()` method.

%% Cell type:code id: tags:

``` python
from deepchem.utils.evaluate import Evaluator

metric = dc.metrics.Metric(dc.metrics.r2_score)
evaluator = Evaluator(model, valid_dataset, transformers)
r2score = evaluator.compute_model_performance([metric])
print(r2score)
```

%% Output

    computed_metrics: [0.15666969194533098]
    {'r2_score': 0.15666969194533098}

%% Cell type:markdown id: tags:

The performance of this basic random-forest model isn't very strong. To construct stronger models, let's attempt to optimize the hyperparameters (choices made in the model-specification) to achieve better performance. For random forests, we can tweak `n_estimators` which controls the number of trees in the forest, and `max_features` which controls the number of features to consider when performing a split. We now build a series of `SklearnModel`s with different choices for `n_estimators` and `max_features` and evaluate performance on the validation set.

%% Cell type:code id: tags:

``` python
def rf_model_builder(model_params, model_dir):
  sklearn_model = RandomForestRegressor(**model_params)
  return dc.models.SklearnModel(sklearn_model, model_dir)
params_dict = {
    "n_estimators": [10, 100],
    "max_features": ["auto", "sqrt", "log2", None],
}

metric = dc.metrics.Metric(dc.metrics.r2_score)
optimizer = dc.hyper.HyperparamOpt(rf_model_builder)
best_rf, best_rf_hyperparams, all_rf_results = optimizer.hyperparam_search(
    params_dict, train_dataset, valid_dataset, transformers,
    metric=metric)
```

%% Output

    Fitting model 1/8
    hyperparameters: {'n_estimators': 10, 'max_features': 'auto'}
    computed_metrics: [0.16084049792955879]
    Model 1/8, Metric r2_score, Validation set 0: 0.160840
    	best_validation_score so far: 0.160840
    Fitting model 2/8
    hyperparameters: {'n_estimators': 10, 'max_features': 'sqrt'}
    computed_metrics: [0.24188860561412873]
    Model 2/8, Metric r2_score, Validation set 1: 0.241889
    	best_validation_score so far: 0.241889
    Fitting model 3/8
    hyperparameters: {'n_estimators': 10, 'max_features': 'log2'}
    computed_metrics: [0.13923134159755857]
    Model 3/8, Metric r2_score, Validation set 2: 0.139231
    	best_validation_score so far: 0.241889
    Fitting model 4/8
    hyperparameters: {'n_estimators': 10, 'max_features': None}
    computed_metrics: [0.068066979889086832]
    Model 4/8, Metric r2_score, Validation set 3: 0.068067
    	best_validation_score so far: 0.241889
    Fitting model 5/8
    hyperparameters: {'n_estimators': 100, 'max_features': 'auto'}
    computed_metrics: [0.13350083343444363]
    Model 5/8, Metric r2_score, Validation set 4: 0.133501
    	best_validation_score so far: 0.241889
    Fitting model 6/8
    hyperparameters: {'n_estimators': 100, 'max_features': 'sqrt'}
    computed_metrics: [0.27500018179983798]
    Model 6/8, Metric r2_score, Validation set 5: 0.275000
    	best_validation_score so far: 0.275000
    Fitting model 7/8
    hyperparameters: {'n_estimators': 100, 'max_features': 'log2'}
    computed_metrics: [0.24719988102520696]
    Model 7/8, Metric r2_score, Validation set 6: 0.247200
    	best_validation_score so far: 0.275000
    Fitting model 8/8
    hyperparameters: {'n_estimators': 100, 'max_features': None}
    computed_metrics: [0.15709535184995427]
    Model 8/8, Metric r2_score, Validation set 7: 0.157095
    	best_validation_score so far: 0.275000
    computed_metrics: [0.94421662988139621]
    Best hyperparameters: (100, 'sqrt')
    train_score: 0.944217
    validation_score: 0.275000

%% Cell type:markdown id: tags:

The best model achieves significantly higher $R^2$ on the validation set than the first model we constructed. Now, let's perform the same sort of hyperparameter search, but with a simple deep-network instead.

%% Cell type:code id: tags:

``` python
import numpy.random

params_dict = {"learning_rate": np.power(10., np.random.uniform(-5, -3, size=1)),
               "decay": np.power(10, np.random.uniform(-6, -4, size=1)),
               "nb_epoch": [20] }
n_features = train_dataset.get_data_shape()[0]
def model_builder(model_params, model_dir):
  model = dc.models.TensorflowMultitaskRegressor(
  model = dc.models.MultitaskRegressor(
    1, n_features, layer_sizes=[1000], dropouts=[.25],
    batch_size=50, **model_params)
  return model

optimizer = dc.hyper.HyperparamOpt(model_builder)
best_dnn, best_dnn_hyperparams, all_dnn_results = optimizer.hyperparam_search(
    params_dict, train_dataset, valid_dataset, transformers,
    metric=metric)
```

%% Output

    Fitting model 1/1
    hyperparameters: {'learning_rate': 0.00023048114774615143, 'nb_epoch': 20, 'decay': 4.6386701591998651e-06}
    Training for 20 epochs
    On batch 0
    Ending epoch 0: Average loss 2.90304
    On batch 0
    Ending epoch 1: Average loss 1.80178
    On batch 0
    Ending epoch 2: Average loss 1.49158
    On batch 0
    Ending epoch 3: Average loss 1.32873
    On batch 0
    Ending epoch 4: Average loss 1.1639
    On batch 0
    Ending epoch 5: Average loss 1.00606
    On batch 0
    Ending epoch 6: Average loss 0.93537
    On batch 0
    Ending epoch 7: Average loss 0.866738
    On batch 0
    Ending epoch 8: Average loss 0.785585
    On batch 0
    Ending epoch 9: Average loss 0.735722
    On batch 0
    Ending epoch 10: Average loss 0.693006
    On batch 0
    Ending epoch 11: Average loss 0.620769
    On batch 0
    Ending epoch 12: Average loss 0.59393
    On batch 0
    Ending epoch 13: Average loss 0.565747
    On batch 0
    Ending epoch 14: Average loss 0.558274
    On batch 0
    Ending epoch 15: Average loss 0.522325
    On batch 0
    Ending epoch 16: Average loss 0.52158
    On batch 0
    Ending epoch 17: Average loss 0.483039
    On batch 0
    Ending epoch 18: Average loss 0.472697
    On batch 0
    Ending epoch 19: Average loss 0.463411
    On batch 0
    TIMING: model fitting took 4.142 s True
    computed_metrics: [0.19197438493808416]
    Model 1/1, Metric r2_score, Validation set 0: 0.191974
    	best_validation_score so far: 0.191974
    computed_metrics: [0.81127373811679659]
    Best hyperparameters: (0.00023048114774615143, 20, 4.6386701591998651e-06)
    train_score: 0.811274
    validation_score: 0.191974

%% Cell type:markdown id: tags:

Now that we have a reasonable choice of hyperparameters, let's evaluate the performance of our best models on the test-set.

%% Cell type:code id: tags:

``` python
rf_test_evaluator = Evaluator(best_rf, test_dataset, transformers)
rf_test_r2score = rf_test_evaluator.compute_model_performance([metric])
print("RF Test set R^2 %f" % (rf_test_r2score["r2_score"]))
```

%% Output

    computed_metrics: [0.36485280404796627]
    RF Test set R^2 0.364853

%% Cell type:code id: tags:

``` python
dnn_test_evaluator = Evaluator(best_dnn, test_dataset, transformers)
dnn_test_r2score = dnn_test_evaluator.compute_model_performance([metric])
print("DNN Test set R^2 %f" % (dnn_test_r2score["r2_score"]))
```

%% Output

    computed_metrics: [0.24496914941196501]
    DNN Test set R^2 0.244969

%% Cell type:markdown id: tags:

Now, let's plot the predicted $R^2$ scores versus the true $R^2$ scores for the constructed model.

%% Cell type:code id: tags:

``` python
task = "measured log solubility in mols per litre"
predicted_test = best_rf.predict(test_dataset)
true_test = test_dataset.y
plt.scatter(predicted_test, true_test)
plt.xlabel('Predicted log-solubility in mols/liter')
plt.ylabel('True log-solubility in mols/liter')
plt.title(r'RF- predicted vs. true log-solubilities')
plt.show()
```

%% Output


%% Cell type:code id: tags:

``` python
task = "measured log solubility in mols per litre"
predicted_test = best_dnn.predict(test_dataset)
true_test = test_dataset.y
plt.scatter(predicted_test, true_test)
plt.xlabel('Predicted log-solubility in mols/liter')
plt.ylabel('True log-solubility in mols/liter')
plt.title(r'DNN predicted vs. true log-solubilities')
plt.show()
```

%% Output


%% Cell type:markdown id: tags:

[1] John S. Delaney. ESOL: Estimating aqueous solubility directly from molecular structure. Journal
of Chemical Information and Computer Sciences, 44(3):1000–1005, 2004.

[2] Anderson, Eric, Gilman D. Veith, and David Weininger. SMILES, a line notation and computerized
interpreter for chemical structures. US Environmental Protection Agency, Environmental Research Laboratory, 1987.

[3] Rogers, David, and Mathew Hahn. "Extended-connectivity fingerprints." Journal of chemical information
and modeling 50.5 (2010): 742-754.

[4] Van Der Walt, Stefan, S. Chris Colbert, and Gael Varoquaux.
"The NumPy array:a structure for efficient numerical computation." Computing in Science & Engineering 13.2 (2011): 22-30.

[5] Bemis, Guy W., and Mark A. Murcko. "The properties of known drugs. 1. Molecular frameworks."
Journal of medicinal chemistry 39.15 (1996): 2887-2893.

[6] Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python." The Journal of Machine Learning Research 12 (2011): 2825-2830.