Commit 2e4d1cec authored by leswing's avatar leswing
Browse files

Update

parent 57e538ed
Loading
Loading
Loading
Loading
+1 −1
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Screening Zinc For HIV Inhibition

%% Cell type:markdown id: tags:

In this tutorial I will walk through how to efficiently screen a large compound library with DeepChem (ZINC).  Screening a large compound library using machine learning is a CPU bound pleasingly parrellel problem.  The actual code examples I will use assume the resources available are a single very big machine (like an AWS c5.18xlarge), but should be readily swappable for other systmes (like a super computing cluster).  At a high level what we will do is...

1. Create a Machine Learning Model Over Labeled Data
2. Transform ZINC into "Work-Units"
3. Create an inference script which runs predictions over a "Work-Unit"
4. Load "Work-Unit" into "distribution mechanism"
5. Consume work units from "distribution mechanism"
6. Gather Results

# 1. Train Model On Labelled Data

We are just going to knock out a simple model here.  In a real world problem you will probably try several models and do a little hyper parameter searching.

%% Cell type:code id: tags:

``` python
from deepchem.molnet.load_function import hiv_datasets
```

%% Cell type:code id: tags:

``` python
from deepchem.models import GraphConvModel
from deepchem.data import NumpyDataset
from sklearn.metrics import average_precision_score
import numpy as np

tasks, all_datasets, transformers = hiv_datasets.load_hiv(featurizer="GraphConv")
train, valid, test = [NumpyDataset.from_DiskDataset(x) for x in all_datasets]
model = GraphConvModel(1, mode="classification")
model.fit(train)
```

%% Output

    Loading dataset from disk.
    Loading dataset from disk.
    Loading dataset from disk.

    /home/leswing/miniconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
      "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

    94.7494862112294

%% Cell type:code id: tags:

``` python
y_true = np.squeeze(valid.y)
y_pred = model.predict(valid)[:,0,1]
print("Average Precision Score:%s" % average_precision_score(y_true, y_pred))
sorted_results = sorted(zip(y_pred, y_true), reverse=True)
hit_rate_100 = sum(x[1] for x in sorted_results[:100]) / 100
print("Hit Rate Top 100: %s" % hit_rate_100)
```

%% Output

    Average Precision Score:0.22277598937482013
    Hit Rate Top 100: 0.33

%% Cell type:markdown id: tags:

## Retrain Model Over Full Dataset For The Screen

%% Cell type:code id: tags:

``` python
tasks, all_datasets, transformers = hiv_datasets.load_hiv(featurizer="GraphConv", split=None)

model = GraphConvModel(1, mode="classification", model_dir="/tmp/zinc/screen_model")
model.fit(all_datasets[0])
model.save()
```

%% Output

    Loading raw samples now.
    shard_size: 8192
    About to start loading CSV from /tmp/HIV.csv
    Loading shard 1 of size 8192.
    Featurizing sample 0
    Featurizing sample 1000
    Featurizing sample 2000
    Featurizing sample 3000
    Featurizing sample 4000
    Featurizing sample 5000
    Featurizing sample 6000
    Featurizing sample 7000
    Featurizing sample 8000
    TIMING: featurizing shard 0 took 15.701 s
    Loading shard 2 of size 8192.
    Featurizing sample 0
    Featurizing sample 1000
    Featurizing sample 2000
    Featurizing sample 3000
    Featurizing sample 4000
    Featurizing sample 5000
    Featurizing sample 6000
    Featurizing sample 7000
    Featurizing sample 8000
    TIMING: featurizing shard 1 took 15.869 s
    Loading shard 3 of size 8192.
    Featurizing sample 0
    Featurizing sample 1000
    Featurizing sample 2000
    Featurizing sample 3000
    Featurizing sample 4000
    Featurizing sample 5000
    Featurizing sample 6000
    Featurizing sample 7000
    Featurizing sample 8000
    TIMING: featurizing shard 2 took 19.106 s
    Loading shard 4 of size 8192.
    Featurizing sample 0
    Featurizing sample 1000
    Featurizing sample 2000
    Featurizing sample 3000
    Featurizing sample 4000
    Featurizing sample 5000
    Featurizing sample 6000
    Featurizing sample 7000
    Featurizing sample 8000
    TIMING: featurizing shard 3 took 16.267 s
    Loading shard 5 of size 8192.
    Featurizing sample 0
    Featurizing sample 1000
    Featurizing sample 2000
    Featurizing sample 3000
    Featurizing sample 4000
    Featurizing sample 5000
    Featurizing sample 6000
    Featurizing sample 7000
    Featurizing sample 8000
    TIMING: featurizing shard 4 took 16.754 s
    Loading shard 6 of size 8192.
    Featurizing sample 0
    TIMING: featurizing shard 5 took 0.446 s
    TIMING: dataset construction took 98.214 s
    Loading dataset from disk.
    TIMING: dataset construction took 21.127 s
    Loading dataset from disk.

    /home/leswing/miniconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
      "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

%% Cell type:markdown id: tags:

# 2. Create Work-Units

1. Download All of ZINC15.

Go to http://zinc15.docking.org/tranches/home and download all non-empty tranches in .smi format.
I found it easiest to download the wget script and then run the wget script.
For the rest of this tutorial I will assume zinc was downloaded to /tmp/zinc.


The way zinc downloads the data isn't great for inference.  We want "Work-Units" which a single CPU can execute that takes a resonable amount of time (10 minutes to an hour).  To accomplish this we are going to split the zinc data into files each with 500 thousand lines.


```bash
mkdir /tmp/zinc/screen
find /tmp/zinc -name '*.smi' -exec cat {} \; | grep -iv "smiles" \
     | split -l 500000 /tmp/zinc/screen/segment
```

This bash command
1. Finds all smi files
2. prints to stdout the contents of the file
3. removes header lines
4. splits into multiple files in /tmp/zinc/screen that are 1 million molecules long

%% Cell type:markdown id: tags:

## 3. Creat Inference Script

Now that we have work unit we need to construct a program which ingests a work unit and logs the result.
For this example we will get the work unit via a file-path, and log the result to a file.
An easy extensions to distribute over multiple computers would be to get the work unit via a url, and log the results to a distributed queue.

Here is what mine looks like

inference.py
```python
import deepchem as dc

def create_dataset(lines, batch_size=50000):
    featurizer = dc.feat.ConvMolFeaturizer()
    for i in range(0, len(lines), batch_size):
        chunk = lines[i:i+batch_size]
        mols, orig_lines = [], []
        for line in chunk:
            try:
                mol = Chem.MolFromSmiles(line[0])
                if mol is None:
                    continue
                orig_lines.append(line)
                mols.append(mol)
            except:
                pass
        features = featurizer.featurize(mols)
        ds = dc.data.NumpyDataset(features, np.ones(len(features))
        yield ds, orig_lines



def evaluate(fname):
    lines = [x.strip().split() for x in open(fname).readlines()]

if __name__ == "__main__":
    evaluate(sys.argv[1])
```

%% Cell type:markdown id: tags:

# 4. Load "Work-Unit" into "distribution mechanism"

We are going to use a flat file as our distribution mechanism.  It will be a bash script calling our inference script for every work unit.

%% Cell type:code id: tags:

``` python
import os
work_units = os.listdir('/tmp/zinc/screen')
with open('/tmp/zinc/driver.sh', 'w') as fout:
    fout.write("#!/bin/bash\n")
    fout.write("export PATH=%s" % os.environ["PATH"])
    for work_unit in work_units:
        full_path = os.path.join('/tmp/zinc', work_unit)
        fout.write("python inference.py %s" % full_path)
```

%% Cell type:markdown id: tags:

# 5. Consume work units from "distribution mechanism"

We will consume work units from our flat file using a very simple process_pool.  It takes lines from our "distribution mechanism" and runs them, running as many processes in parrallel as we have cpus.
We will consume work units from our flat file using a very simple process_pool.  It takes lines from our "distribution mechanism" and runs them, running as many processes in parrallel as we have cpus.  While tensorflow-cpu does parallelize to multiple cpus for some matrix operations we can get better throughput by pinning each process to a single cpu using taskset.

process_pool.py
```python
import multiprocessing
import sys
from multiprocessing.pool import Pool

import delegator


def run_command(args):
  q, command = args
  cpu_id = q.get()
  try:
    command = "taskset -c %s %s" % (cpu_id, command)
    print("running %s" % command)
    c = delegator.run(command)
    print(c.err)
    print(c.out)
  except Exception as e:
    print(e)
  q.put(cpu_id)


def main(n_processors, command_file):
  commands = [x.strip() for x in open(command_file).readlines()]
  commands = list(filter(lambda x: not x.startswith("#"), commands))
  q = multiprocessing.Manager().Queue()
  for i in range(n_processors):
    q.put(i)
  argslist = [(q, x) for x in commands]
  pool = Pool(processes=n_processors)
  pool.map(run_command, argslist)


if __name__ == "__main__":
  processors = multiprocessing.cpu_count()
  main(processors, sys.argv[1])
```


```bash
>> python process_pool.py /tmp/zinc/driver.sh
```

%% Cell type:markdown id: tags:

# 6. Gather Results
Since we logged our results to \*_out.smi we now need to gather all of them up and sort them by our predictions

```bash
find /tmp/zinc -name '*_out.smi' -exec cat {} \; | sort -rn -k 3,3 > /tmp/zinc/screen/sorted_results.smi
# Put the top 100k scoring molecules in their own file
head -n 50000 /tmp/zinc/screen/sorted_results. > /tmp/zinc/screen/top_100k.smi
```

/tmp/zinc/screen/top_100k.smi is now a small enough file to investigate using standard tools like pandas.

%% Cell type:code id: tags:

``` python
```