Update (2e4d1cec) · Commits · 钟慕尧 / deepchem

examples/notebooks/Large_Scale_Chemical_Screens.ipynb

+1 −1

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		# Screening Zinc For HIV Inhibition

		%% Cell type:markdown id: tags:

		In this tutorial I will walk through how to efficiently screen a large compound library with DeepChem (ZINC). Screening a large compound library using machine learning is a CPU bound pleasingly parrellel problem. The actual code examples I will use assume the resources available are a single very big machine (like an AWS c5.18xlarge), but should be readily swappable for other systmes (like a super computing cluster). At a high level what we will do is...

		1. Create a Machine Learning Model Over Labeled Data
		2. Transform ZINC into "Work-Units"
		3. Create an inference script which runs predictions over a "Work-Unit"
		4. Load "Work-Unit" into "distribution mechanism"
		5. Consume work units from "distribution mechanism"
		6. Gather Results

		# 1. Train Model On Labelled Data

		We are just going to knock out a simple model here. In a real world problem you will probably try several models and do a little hyper parameter searching.

		%% Cell type:code id: tags:

		``` python
		from deepchem.molnet.load_function import hiv_datasets
		```

		%% Cell type:code id: tags:

		``` python
		from deepchem.models import GraphConvModel
		from deepchem.data import NumpyDataset
		from sklearn.metrics import average_precision_score
		import numpy as np

		tasks, all_datasets, transformers = hiv_datasets.load_hiv(featurizer="GraphConv")
		train, valid, test = [NumpyDataset.from_DiskDataset(x) for x in all_datasets]
		model = GraphConvModel(1, mode="classification")
		model.fit(train)
		```

		%% Output

		Loading dataset from disk.
		Loading dataset from disk.
		Loading dataset from disk.

		/home/leswing/miniconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
		"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

		94.7494862112294

		%% Cell type:code id: tags:

		``` python
		y_true = np.squeeze(valid.y)
		y_pred = model.predict(valid)[:,0,1]
		print("Average Precision Score:%s" % average_precision_score(y_true, y_pred))
		sorted_results = sorted(zip(y_pred, y_true), reverse=True)
		hit_rate_100 = sum(x[1] for x in sorted_results[:100]) / 100
		print("Hit Rate Top 100: %s" % hit_rate_100)
		```

		%% Output

		Average Precision Score:0.22277598937482013
		Hit Rate Top 100: 0.33

		%% Cell type:markdown id: tags:

		## Retrain Model Over Full Dataset For The Screen

		%% Cell type:code id: tags:

		``` python
		tasks, all_datasets, transformers = hiv_datasets.load_hiv(featurizer="GraphConv", split=None)

		model = GraphConvModel(1, mode="classification", model_dir="/tmp/zinc/screen_model")
		model.fit(all_datasets[0])
		model.save()
		```

		%% Output

		Loading raw samples now.
		shard_size: 8192
		About to start loading CSV from /tmp/HIV.csv
		Loading shard 1 of size 8192.
		Featurizing sample 0
		Featurizing sample 1000
		Featurizing sample 2000
		Featurizing sample 3000
		Featurizing sample 4000
		Featurizing sample 5000
		Featurizing sample 6000
		Featurizing sample 7000
		Featurizing sample 8000
		TIMING: featurizing shard 0 took 15.701 s
		Loading shard 2 of size 8192.
		Featurizing sample 0
		Featurizing sample 1000
		Featurizing sample 2000
		Featurizing sample 3000
		Featurizing sample 4000
		Featurizing sample 5000
		Featurizing sample 6000
		Featurizing sample 7000
		Featurizing sample 8000
		TIMING: featurizing shard 1 took 15.869 s
		Loading shard 3 of size 8192.
		Featurizing sample 0
		Featurizing sample 1000
		Featurizing sample 2000
		Featurizing sample 3000
		Featurizing sample 4000
		Featurizing sample 5000
		Featurizing sample 6000
		Featurizing sample 7000
		Featurizing sample 8000
		TIMING: featurizing shard 2 took 19.106 s
		Loading shard 4 of size 8192.
		Featurizing sample 0
		Featurizing sample 1000
		Featurizing sample 2000
		Featurizing sample 3000
		Featurizing sample 4000
		Featurizing sample 5000
		Featurizing sample 6000
		Featurizing sample 7000
		Featurizing sample 8000
		TIMING: featurizing shard 3 took 16.267 s
		Loading shard 5 of size 8192.
		Featurizing sample 0
		Featurizing sample 1000
		Featurizing sample 2000
		Featurizing sample 3000
		Featurizing sample 4000
		Featurizing sample 5000
		Featurizing sample 6000
		Featurizing sample 7000
		Featurizing sample 8000
		TIMING: featurizing shard 4 took 16.754 s
		Loading shard 6 of size 8192.
		Featurizing sample 0
		TIMING: featurizing shard 5 took 0.446 s
		TIMING: dataset construction took 98.214 s
		Loading dataset from disk.
		TIMING: dataset construction took 21.127 s
		Loading dataset from disk.

		/home/leswing/miniconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
		"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

		%% Cell type:markdown id: tags:

		# 2. Create Work-Units

		1. Download All of ZINC15.

		Go to http://zinc15.docking.org/tranches/home and download all non-empty tranches in .smi format.
		I found it easiest to download the wget script and then run the wget script.
		For the rest of this tutorial I will assume zinc was downloaded to /tmp/zinc.


		The way zinc downloads the data isn't great for inference. We want "Work-Units" which a single CPU can execute that takes a resonable amount of time (10 minutes to an hour). To accomplish this we are going to split the zinc data into files each with 500 thousand lines.


		```bash
		mkdir /tmp/zinc/screen
		find /tmp/zinc -name '*.smi' -exec cat {} \; \| grep -iv "smiles" \
		\| split -l 500000 /tmp/zinc/screen/segment
		```

		This bash command
		1. Finds all smi files
		2. prints to stdout the contents of the file
		3. removes header lines
		4. splits into multiple files in /tmp/zinc/screen that are 1 million molecules long

		%% Cell type:markdown id: tags:

		## 3. Creat Inference Script

		Now that we have work unit we need to construct a program which ingests a work unit and logs the result.
		For this example we will get the work unit via a file-path, and log the result to a file.
		An easy extensions to distribute over multiple computers would be to get the work unit via a url, and log the results to a distributed queue.

		Here is what mine looks like

		inference.py
		```python
		import deepchem as dc

		def create_dataset(lines, batch_size=50000):
		featurizer = dc.feat.ConvMolFeaturizer()
		for i in range(0, len(lines), batch_size):
		chunk = lines[i:i+batch_size]
		mols, orig_lines = [], []
		for line in chunk:
		try:
		mol = Chem.MolFromSmiles(line[0])
		if mol is None:
		continue
		orig_lines.append(line)
		mols.append(mol)
		except:
		pass
		features = featurizer.featurize(mols)
		ds = dc.data.NumpyDataset(features, np.ones(len(features))
		yield ds, orig_lines



		def evaluate(fname):
		lines = [x.strip().split() for x in open(fname).readlines()]

		if __name__ == "__main__":
		evaluate(sys.argv[1])
		```

		%% Cell type:markdown id: tags:

		# 4. Load "Work-Unit" into "distribution mechanism"

		We are going to use a flat file as our distribution mechanism. It will be a bash script calling our inference script for every work unit.

		%% Cell type:code id: tags:

		``` python
		import os
		work_units = os.listdir('/tmp/zinc/screen')
		with open('/tmp/zinc/driver.sh', 'w') as fout:
		fout.write("#!/bin/bash\n")
		fout.write("export PATH=%s" % os.environ["PATH"])
		for work_unit in work_units:
		full_path = os.path.join('/tmp/zinc', work_unit)
		fout.write("python inference.py %s" % full_path)
		```

		%% Cell type:markdown id: tags:

		# 5. Consume work units from "distribution mechanism"

		We will consume work units from our flat file using a very simple process_pool. It takes lines from our "distribution mechanism" and runs them, running as many processes in parrallel as we have cpus.
		We will consume work units from our flat file using a very simple process_pool. It takes lines from our "distribution mechanism" and runs them, running as many processes in parrallel as we have cpus. While tensorflow-cpu does parallelize to multiple cpus for some matrix operations we can get better throughput by pinning each process to a single cpu using taskset.

		process_pool.py
		```python
		import multiprocessing
		import sys
		from multiprocessing.pool import Pool

		import delegator


		def run_command(args):
		q, command = args
		cpu_id = q.get()
		try:
		command = "taskset -c %s %s" % (cpu_id, command)
		print("running %s" % command)
		c = delegator.run(command)
		print(c.err)
		print(c.out)
		except Exception as e:
		print(e)
		q.put(cpu_id)


		def main(n_processors, command_file):
		commands = [x.strip() for x in open(command_file).readlines()]
		commands = list(filter(lambda x: not x.startswith("#"), commands))
		q = multiprocessing.Manager().Queue()
		for i in range(n_processors):
		q.put(i)
		argslist = [(q, x) for x in commands]
		pool = Pool(processes=n_processors)
		pool.map(run_command, argslist)


		if __name__ == "__main__":
		processors = multiprocessing.cpu_count()
		main(processors, sys.argv[1])
		```


		```bash
		>> python process_pool.py /tmp/zinc/driver.sh
		```

		%% Cell type:markdown id: tags:

		# 6. Gather Results
		Since we logged our results to \*_out.smi we now need to gather all of them up and sort them by our predictions

		```bash
		find /tmp/zinc -name '*_out.smi' -exec cat {} \; \| sort -rn -k 3,3 > /tmp/zinc/screen/sorted_results.smi
		# Put the top 100k scoring molecules in their own file
		head -n 50000 /tmp/zinc/screen/sorted_results. > /tmp/zinc/screen/top_100k.smi
		```

		/tmp/zinc/screen/top_100k.smi is now a small enough file to investigate using standard tools like pandas.

		%% Cell type:code id: tags:

		``` python
		```

Admin message