Commit 9cede3c6 authored by leswing's avatar leswing
Browse files

Publish It

parent 659d9228
Loading
Loading
Loading
Loading
+8 −5
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Screening Zinc For HIV Inhibition

%% Cell type:markdown id: tags:

In this tutorial I will walk through how to efficiently screen a large compound library with DeepChem (ZINC).  Screening a large compound library using machine learning is a CPU bound pleasingly parrellel problem.  The actual code examples I will use assume the resources available are a single very big machine (like an AWS c5.18xlarge), but should be readily swappable for other systmes (like a super computing cluster).  At a high level what we will do is...

1. Create a Machine Learning Model Over Labeled Data
2. Transform ZINC into "Work-Units"
3. Create an inference script which runs predictions over a "Work-Unit"
4. Load "Work-Unit" into a "Work Queue"
5. Consume work units from "Work Queue"
6. Gather Results

# 1. Train Model On Labelled Data

We are just going to knock out a simple model here.  In a real world problem you will probably try several models and do a little hyper parameter searching.

%% Cell type:code id: tags:

``` python
from deepchem.molnet.load_function import hiv_datasets
```

%% Cell type:code id: tags:

``` python
from deepchem.models import GraphConvModel
from deepchem.data import NumpyDataset
from sklearn.metrics import average_precision_score
import numpy as np

tasks, all_datasets, transformers = hiv_datasets.load_hiv(featurizer="GraphConv")
train, valid, test = [NumpyDataset.from_DiskDataset(x) for x in all_datasets]
model = GraphConvModel(1, mode="classification")
model.fit(train)
```

%% Output

    Loading dataset from disk.
    Loading dataset from disk.
    Loading dataset from disk.

    /home/leswing/miniconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
      "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

    94.7494862112294

%% Cell type:code id: tags:

``` python
y_true = np.squeeze(valid.y)
y_pred = model.predict(valid)[:,0,1]
print("Average Precision Score:%s" % average_precision_score(y_true, y_pred))
sorted_results = sorted(zip(y_pred, y_true), reverse=True)
hit_rate_100 = sum(x[1] for x in sorted_results[:100]) / 100
print("Hit Rate Top 100: %s" % hit_rate_100)
```

%% Output

    Average Precision Score:0.22277598937482013
    Hit Rate Top 100: 0.33

%% Cell type:markdown id: tags:

## Retrain Model Over Full Dataset For The Screen

%% Cell type:code id: tags:

``` python
tasks, all_datasets, transformers = hiv_datasets.load_hiv(featurizer="GraphConv", split=None)

model = GraphConvModel(1, mode="classification", model_dir="/tmp/zinc/screen_model")
model.fit(all_datasets[0])
model.save()
```

%% Output

    Loading raw samples now.
    shard_size: 8192
    About to start loading CSV from /tmp/HIV.csv
    Loading shard 1 of size 8192.
    Featurizing sample 0
    Featurizing sample 1000
    Featurizing sample 2000
    Featurizing sample 3000
    Featurizing sample 4000
    Featurizing sample 5000
    Featurizing sample 6000
    Featurizing sample 7000
    Featurizing sample 8000
    TIMING: featurizing shard 0 took 15.701 s
    Loading shard 2 of size 8192.
    Featurizing sample 0
    Featurizing sample 1000
    Featurizing sample 2000
    Featurizing sample 3000
    Featurizing sample 4000
    Featurizing sample 5000
    Featurizing sample 6000
    Featurizing sample 7000
    Featurizing sample 8000
    TIMING: featurizing shard 1 took 15.869 s
    Loading shard 3 of size 8192.
    Featurizing sample 0
    Featurizing sample 1000
    Featurizing sample 2000
    Featurizing sample 3000
    Featurizing sample 4000
    Featurizing sample 5000
    Featurizing sample 6000
    Featurizing sample 7000
    Featurizing sample 8000
    TIMING: featurizing shard 2 took 19.106 s
    Loading shard 4 of size 8192.
    Featurizing sample 0
    Featurizing sample 1000
    Featurizing sample 2000
    Featurizing sample 3000
    Featurizing sample 4000
    Featurizing sample 5000
    Featurizing sample 6000
    Featurizing sample 7000
    Featurizing sample 8000
    TIMING: featurizing shard 3 took 16.267 s
    Loading shard 5 of size 8192.
    Featurizing sample 0
    Featurizing sample 1000
    Featurizing sample 2000
    Featurizing sample 3000
    Featurizing sample 4000
    Featurizing sample 5000
    Featurizing sample 6000
    Featurizing sample 7000
    Featurizing sample 8000
    TIMING: featurizing shard 4 took 16.754 s
    Loading shard 6 of size 8192.
    Featurizing sample 0
    TIMING: featurizing shard 5 took 0.446 s
    TIMING: dataset construction took 98.214 s
    Loading dataset from disk.
    TIMING: dataset construction took 21.127 s
    Loading dataset from disk.

    /home/leswing/miniconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
      "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

%% Cell type:markdown id: tags:

# 2. Create Work-Units

1. Download All of ZINC15.

Go to http://zinc15.docking.org/tranches/home and download all non-empty tranches in .smi format.
I found it easiest to download the wget script and then run the wget script.
For the rest of this tutorial I will assume zinc was downloaded to /tmp/zinc.


The way zinc downloads the data isn't great for inference.  We want "Work-Units" which a single CPU can execute that takes a resonable amount of time (10 minutes to an hour).  To accomplish this we are going to split the zinc data into files each with 500 thousand lines.


```bash
mkdir /tmp/zinc/screen
find /tmp/zinc -name '*.smi' -exec cat {} \; | grep -iv "smiles" \
     | split -l 500000 /tmp/zinc/screen/segment
```

This bash command
1. Finds all smi files
2. prints to stdout the contents of the file
3. removes header lines
4. splits into multiple files in /tmp/zinc/screen that are 1 million molecules long
4. splits into multiple files in /tmp/zinc/screen that are 500k molecules long

%% Cell type:markdown id: tags:

## 3. Creat Inference Script

Now that we have work unit we need to construct a program which ingests a work unit and logs the result.  It is important that the logging mechanism is thread safe!
For this example we will get the work unit via a file-path, and log the result to a file.
An easy extensions to distribute over multiple computers would be to get the work unit via a url, and log the results to a distributed queue.

Here is what mine looks like

inference.py
```python
import sys
import deepchem as dc
import numpy as np
from rdkit import Chem
import pickle
import os


def create_dataset(fname, batch_size=50000):
    featurizer = dc.feat.ConvMolFeaturizer()
    fin = open(fname)
    mols, orig_lines = [], []
    for line in fin:
        line = line.strip().split()
        try:
            mol = Chem.MolFromSmiles(line[0])
            if mol is None:
                continue
            mols.append(mol)
            orig_lines.append(line)
        except:
            pass
        if len(mols) > 0 and len(mols) % batch_size == 0:
            features = featurizer.featurize(mols)
            y = np.ones(shape=(len(mols), 1))
            ds = dc.data.NumpyDataset(features, y)
            yield ds, orig_lines
            mols, orig_lines = [], []
    if len(mols) > 0 and len(mols) % batch_size == 0:
    if len(mols) > 0:
        features = featurizer.featurize(mols)
        y = np.ones(shape=(len(mols), 1))
        ds = dc.data.NumpyDataset(features, y)
        yield ds, orig_lines


def evaluate(fname):
    fout_name = "%s_out.smi" % fname
    if os.path.exists(fout_name):
        return
    model = dc.models.TensorGraph.load_from_dir('screen_model')
    for ds, lines in create_dataset(fname):
        y_pred = np.squeeze(model.predict(ds), axis=1)
        with open(fout_name, 'a') as fout:
            for index, line in enumerate(lines):
                line.append(y_pred[index][1])
                line = [str(x) for x in line]
                line = "\t".join(line)
                fout.write("%s\n" % line)


if __name__ == "__main__":
    evaluate(sys.argv[1])
```

%% Cell type:markdown id: tags:

# 4. Load "Work-Unit" into a "Work Queue"

We are going to use a flat file as our distribution mechanism.  It will be a bash script calling our inference script for every work unit.  If you are at an academic institution this would be queing your jobs in pbs/qsub/slurm.  An option for cloud computing would be rabbitmq or kafka.

%% Cell type:code id: tags:

``` python
import os
work_units = os.listdir('/tmp/zinc/screen')
with open('/tmp/zinc/work_queue.sh', 'w') as fout:
    fout.write("#!/bin/bash\n")
    for work_unit in work_units:
        full_path = os.path.join('/tmp/zinc', work_unit)
        fout.write("python inference.py %s" % full_path)
```

%% Cell type:markdown id: tags:

# 5. Consume work units from "distribution mechanism"

We will consume work units from our work queue using a very simple Process Pool.  It takes lines from our "Work Queue" and runs them, running as many processes in parrallel as we have cpus.  If  you are using a supercomputing cluster system like pbs/qsub/slurm it will take care of this for you.  The key is to use one CPU per work unit to get highest throughput.  We accomplish that here using the linux utility "taskset".

Using an c5.18xlarge on aws this will finish overnight.

process_pool.py
```python
import multiprocessing
import sys
from multiprocessing.pool import Pool

import delegator


def run_command(args):
  q, command = args
  cpu_id = q.get()
  try:
    command = "taskset -c %s %s" % (cpu_id, command)
    print("running %s" % command)
    c = delegator.run(command)
    print(c.err)
    print(c.out)
  except Exception as e:
    print(e)
  q.put(cpu_id)


def main(n_processors, command_file):
  commands = [x.strip() for x in open(command_file).readlines()]
  commands = list(filter(lambda x: not x.startswith("#"), commands))
  q = multiprocessing.Manager().Queue()
  for i in range(n_processors):
    q.put(i)
  argslist = [(q, x) for x in commands]
  pool = Pool(processes=n_processors)
  pool.map(run_command, argslist)


if __name__ == "__main__":
  processors = multiprocessing.cpu_count()
  main(processors, sys.argv[1])
```


```bash
>> python process_pool.py /tmp/zinc/work_queue.sh
```

%% Cell type:markdown id: tags:

# 6. Gather Results
Since we logged our results to \*_out.smi we now need to gather all of them up and sort them by our predictions.  The resulting file wile be > 40GB.  To analyze the data further you can use dask, or put the data in a rdkit postgres cartridge.

Here I show how to join the and sort the data to get the "best" results.

```bash
find /tmp/zinc -name '*_out.smi' -exec cat {} \; > /tmp/zinc/screen/results.smi
sort -rg -k 3,3 /tmp/zinc/screen/results.smi > /tmp/zinc/screen/sorted_results.smi
# Put the top 100k scoring molecules in their own file
head -n 50000 /tmp/zinc/screen/sorted_results.smi > /tmp/zinc/screen/top_100k.smi
```

/tmp/zinc/screen/top_100k.smi is now a small enough file to investigate using standard tools like pandas.

%% Cell type:code id: tags:

``` python
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from IPython.display import SVG
from rdkit.Chem.Draw import rdMolDraw2D
best_mols = [Chem.MolFromSmiles(x.strip().split()[0]) for x in open('/tmp/zinc/screen/top_100k.smi').readlines()[:100]]
best_scores = [x.strip().split()[2] for x in open('/tmp/zinc/screen/top_100k.smi').readlines()[:100]]
```

%% Cell type:code id: tags:

``` python
print(best_scores[0])
best_mols[0]
```

%% Output

    0.98874843

    <rdkit.Chem.rdchem.Mol at 0x7f9394ebf940>

%% Cell type:code id: tags:

``` python
print(best_scores[0])
best_mols[1]
```

%% Output

    0.98874843

    <rdkit.Chem.rdchem.Mol at 0x7f9394e7abc0>

%% Cell type:code id: tags:

``` python
print(best_scores[0])
best_mols[2]
```

%% Output

    0.98874843

    <rdkit.Chem.rdchem.Mol at 0x7f9394e818f0>

%% Cell type:code id: tags:

``` python
print(best_scores[0])
best_mols[3]
```

%% Output

    0.98874843

    <rdkit.Chem.rdchem.Mol at 0x7f9394e81990>

%% Cell type:markdown id: tags:

The screen seems to favor molecules with one or multiple sulfur trioxides.  The top scoring molecules also have low diversity.  When creating a "buy list" we want to optimize for more things than just activity, for instance diversity and drug like MPO.
The screen seems to favor molecules with one or multiple sulfur trioxides. The top scoring molecules also have low diversity.  When creating a "buy list" we want to optimize for more things than just activity, for instance diversity and drug like MPO.

%% Cell type:code id: tags:

``` python
```