Minor fixes to notebooks (4984f61e) · Commits · 钟慕尧 / deepchem

examples/notebooks/Large_Scale_Chemical_Screens.ipynb

+0 −1

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		# Screening Zinc For HIV Inhibition

		%% Cell type:markdown id: tags:

		In this tutorial I will walk through how to efficiently screen a large compound library with DeepChem (ZINC). Screening a large compound library using machine learning is a CPU bound pleasingly parrellel problem. The actual code examples I will use assume the resources available are a single very big machine (like an AWS c5.18xlarge), but should be readily swappable for other systmes (like a super computing cluster). At a high level what we will do is...

		1. Create a Machine Learning Model Over Labeled Data
		2. Transform ZINC into "Work-Units"
		3. Create an inference script which runs predictions over a "Work-Unit"
		4. Load "Work-Unit" into a "Work Queue"
		5. Consume work units from "Work Queue"
		6. Gather Results

		# 1. Train Model On Labelled Data

		We are just going to knock out a simple model here. In a real world problem you will probably try several models and do a little hyper parameter searching.

		%% Cell type:code id: tags:

		``` python
		from deepchem.molnet.load_function import hiv_datasets
		```

		%% Cell type:code id: tags:

		``` python
		from deepchem.models import GraphConvModel
		from deepchem.data import NumpyDataset
		from sklearn.metrics import average_precision_score
		import numpy as np

		tasks, all_datasets, transformers = hiv_datasets.load_hiv(featurizer="GraphConv")
		train, valid, test = [NumpyDataset.from_DiskDataset(x) for x in all_datasets]
		model = GraphConvModel(1, mode="classification")
		model.fit(train)
		```

		%% Output

		Loading dataset from disk.
		Loading dataset from disk.
		Loading dataset from disk.

		/home/leswing/miniconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
		"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

		94.7494862112294

		%% Cell type:code id: tags:

		``` python
		y_true = np.squeeze(valid.y)
		y_pred = model.predict(valid)[:,0,1]
		print("Average Precision Score:%s" % average_precision_score(y_true, y_pred))
		sorted_results = sorted(zip(y_pred, y_true), reverse=True)
		hit_rate_100 = sum(x[1] for x in sorted_results[:100]) / 100
		print("Hit Rate Top 100: %s" % hit_rate_100)
		```

		%% Output

		Average Precision Score:0.22277598937482013
		Hit Rate Top 100: 0.33

		%% Cell type:markdown id: tags:

		## Retrain Model Over Full Dataset For The Screen

		%% Cell type:code id: tags:

		``` python
		tasks, all_datasets, transformers = hiv_datasets.load_hiv(featurizer="GraphConv", split=None)

		model = GraphConvModel(1, mode="classification", model_dir="/tmp/zinc/screen_model")
		model.fit(all_datasets[0])
		model.save()
		```

		%% Output

		Loading raw samples now.
		shard_size: 8192
		About to start loading CSV from /tmp/HIV.csv
		Loading shard 1 of size 8192.
		Featurizing sample 0
		Featurizing sample 1000
		Featurizing sample 2000
		Featurizing sample 3000
		Featurizing sample 4000
		Featurizing sample 5000
		Featurizing sample 6000
		Featurizing sample 7000
		Featurizing sample 8000
		TIMING: featurizing shard 0 took 15.701 s
		Loading shard 2 of size 8192.
		Featurizing sample 0
		Featurizing sample 1000
		Featurizing sample 2000
		Featurizing sample 3000
		Featurizing sample 4000
		Featurizing sample 5000
		Featurizing sample 6000
		Featurizing sample 7000
		Featurizing sample 8000
		TIMING: featurizing shard 1 took 15.869 s
		Loading shard 3 of size 8192.
		Featurizing sample 0
		Featurizing sample 1000
		Featurizing sample 2000
		Featurizing sample 3000
		Featurizing sample 4000
		Featurizing sample 5000
		Featurizing sample 6000
		Featurizing sample 7000
		Featurizing sample 8000
		TIMING: featurizing shard 2 took 19.106 s
		Loading shard 4 of size 8192.
		Featurizing sample 0
		Featurizing sample 1000
		Featurizing sample 2000
		Featurizing sample 3000
		Featurizing sample 4000
		Featurizing sample 5000
		Featurizing sample 6000
		Featurizing sample 7000
		Featurizing sample 8000
		TIMING: featurizing shard 3 took 16.267 s
		Loading shard 5 of size 8192.
		Featurizing sample 0
		Featurizing sample 1000
		Featurizing sample 2000
		Featurizing sample 3000
		Featurizing sample 4000
		Featurizing sample 5000
		Featurizing sample 6000
		Featurizing sample 7000
		Featurizing sample 8000
		TIMING: featurizing shard 4 took 16.754 s
		Loading shard 6 of size 8192.
		Featurizing sample 0
		TIMING: featurizing shard 5 took 0.446 s
		TIMING: dataset construction took 98.214 s
		Loading dataset from disk.
		TIMING: dataset construction took 21.127 s
		Loading dataset from disk.

		/home/leswing/miniconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
		"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

		%% Cell type:markdown id: tags:

		# 2. Create Work-Units

		1. Download All of ZINC15.

		Go to http://zinc15.docking.org/tranches/home and download all non-empty tranches in .smi format.
		I found it easiest to download the wget script and then run the wget script.
		For the rest of this tutorial I will assume zinc was downloaded to /tmp/zinc.


		The way zinc downloads the data isn't great for inference. We want "Work-Units" which a single CPU can execute that takes a resonable amount of time (10 minutes to an hour). To accomplish this we are going to split the zinc data into files each with 500 thousand lines.


		```bash
		mkdir /tmp/zinc/screen
		find /tmp/zinc -name '*.smi' -exec cat {} \; \| grep -iv "smiles" \
		\| split -l 500000 /tmp/zinc/screen/segment
		```

		This bash command
		1. Finds all smi files
		2. prints to stdout the contents of the file
		3. removes header lines
		4. splits into multiple files in /tmp/zinc/screen that are 500k molecules long

		%% Cell type:markdown id: tags:

		## 3. Creat Inference Script

		Now that we have work unit we need to construct a program which ingests a work unit and logs the result. It is important that the logging mechanism is thread safe!
		For this example we will get the work unit via a file-path, and log the result to a file.
		An easy extensions to distribute over multiple computers would be to get the work unit via a url, and log the results to a distributed queue.

		Here is what mine looks like

		inference.py
		```python
		import sys
		import deepchem as dc
		import numpy as np
		from rdkit import Chem
		import pickle
		import os


		def create_dataset(fname, batch_size=50000):
		featurizer = dc.feat.ConvMolFeaturizer()
		fin = open(fname)
		mols, orig_lines = [], []
		for line in fin:
		line = line.strip().split()
		try:
		mol = Chem.MolFromSmiles(line[0])
		if mol is None:
		continue
		mols.append(mol)
		orig_lines.append(line)
		except:
		pass
		if len(mols) > 0 and len(mols) % batch_size == 0:
		features = featurizer.featurize(mols)
		y = np.ones(shape=(len(mols), 1))
		ds = dc.data.NumpyDataset(features, y)
		yield ds, orig_lines
		mols, orig_lines = [], []
		if len(mols) > 0:
		features = featurizer.featurize(mols)
		y = np.ones(shape=(len(mols), 1))
		ds = dc.data.NumpyDataset(features, y)
		yield ds, orig_lines


		def evaluate(fname):
		fout_name = "%s_out.smi" % fname
		model = dc.models.TensorGraph.load_from_dir('screen_model')
		for ds, lines in create_dataset(fname):
		y_pred = np.squeeze(model.predict(ds), axis=1)
		with open(fout_name, 'a') as fout:
		for index, line in enumerate(lines):
		line.append(y_pred[index][1])
		line = [str(x) for x in line]
		line = "\t".join(line)
		fout.write("%s\n" % line)


		if __name__ == "__main__":
		evaluate(sys.argv[1])
		```

		%% Cell type:markdown id: tags:

		# 4. Load "Work-Unit" into a "Work Queue"

		We are going to use a flat file as our distribution mechanism. It will be a bash script calling our inference script for every work unit. If you are at an academic institution this would be queing your jobs in pbs/qsub/slurm. An option for cloud computing would be rabbitmq or kafka.

		%% Cell type:code id: tags:

		``` python
		import os
		work_units = os.listdir('/tmp/zinc/screen')
		with open('/tmp/zinc/work_queue.sh', 'w') as fout:
		fout.write("#!/bin/bash\n")
		for work_unit in work_units:
		full_path = os.path.join('/tmp/zinc', work_unit)
		fout.write("python inference.py %s" % full_path)
		```

		%% Cell type:markdown id: tags:

		# 5. Consume work units from "distribution mechanism"

		We will consume work units from our work queue using a very simple Process Pool. It takes lines from our "Work Queue" and runs them, running as many processes in parrallel as we have cpus. If you are using a supercomputing cluster system like pbs/qsub/slurm it will take care of this for you. The key is to use one CPU per work unit to get highest throughput. We accomplish that here using the linux utility "taskset".

		Using an c5.18xlarge on aws this will finish overnight.

		process_pool.py
		```python
		import multiprocessing
		import sys
		from multiprocessing.pool import Pool

		import delegator


		def run_command(args):
		q, command = args
		cpu_id = q.get()
		try:
		command = "taskset -c %s %s" % (cpu_id, command)
		print("running %s" % command)
		c = delegator.run(command)
		print(c.err)
		print(c.out)
		except Exception as e:
		print(e)
		q.put(cpu_id)


		def main(n_processors, command_file):
		commands = [x.strip() for x in open(command_file).readlines()]
		commands = list(filter(lambda x: not x.startswith("#"), commands))
		q = multiprocessing.Manager().Queue()
		for i in range(n_processors):
		q.put(i)
		argslist = [(q, x) for x in commands]
		pool = Pool(processes=n_processors)
		pool.map(run_command, argslist)


		if __name__ == "__main__":
		processors = multiprocessing.cpu_count()
		main(processors, sys.argv[1])
		```


		```bash
		>> python process_pool.py /tmp/zinc/work_queue.sh
		```

		%% Cell type:markdown id: tags:

		# 6. Gather Results
		Since we logged our results to \*_out.smi we now need to gather all of them up and sort them by our predictions. The resulting file wile be > 40GB. To analyze the data further you can use dask, or put the data in a rdkit postgres cartridge.

		Here I show how to join the and sort the data to get the "best" results.

		```bash
		find /tmp/zinc -name '*_out.smi' -exec cat {} \; > /tmp/zinc/screen/results.smi
		sort -rg -k 3,3 /tmp/zinc/screen/results.smi > /tmp/zinc/screen/sorted_results.smi
		# Put the top 100k scoring molecules in their own file
		head -n 50000 /tmp/zinc/screen/sorted_results.smi > /tmp/zinc/screen/top_100k.smi
		```

		/tmp/zinc/screen/top_100k.smi is now a small enough file to investigate using standard tools like pandas.

		%% Cell type:code id: tags:

		``` python
		from rdkit import Chem
		from rdkit.Chem.Draw import IPythonConsole
		from IPython.display import SVG
		from rdkit.Chem.Draw import rdMolDraw2D
		best_mols = [Chem.MolFromSmiles(x.strip().split()[0]) for x in open('/tmp/zinc/screen/top_100k.smi').readlines()[:100]]
		best_scores = [x.strip().split()[2] for x in open('/tmp/zinc/screen/top_100k.smi').readlines()[:100]]
		```

		%% Cell type:code id: tags:

		``` python
		print(best_scores[0])
		best_mols[0]
		```

		%% Output

		0.98874843


		<rdkit.Chem.rdchem.Mol at 0x7f9394ebf940>

		%% Cell type:code id: tags:

		``` python
		print(best_scores[0])
		best_mols[1]
		```

		%% Output

		0.98874843


		<rdkit.Chem.rdchem.Mol at 0x7f9394e7abc0>

		%% Cell type:code id: tags:

		``` python
		print(best_scores[0])
		best_mols[2]
		```

		%% Output

		0.98874843


		<rdkit.Chem.rdchem.Mol at 0x7f9394e818f0>

		%% Cell type:code id: tags:

		``` python
		print(best_scores[0])
		best_mols[3]
		```

		%% Output

		0.98874843


		<rdkit.Chem.rdchem.Mol at 0x7f9394e81990>

		%% Cell type:markdown id: tags:

		The screen seems to favor molecules with one or multiple sulfur trioxides. The top scoring molecules also have low diversity. When creating a "buy list" we want to optimize for more things than just activity, for instance diversity and drug like MPO.

		%% Cell type:code id: tags:

		``` python
		```

examples/notebooks/seqtoseq_fingerprint.ipynb

+2 −2

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		SeqToSeq Fingerprint
		--------------------

		In this example, we will use a `SeqToSeq` model to generate fingerprints for classifying molecules. This is based on the following paper, although some of the implementation details are different: Xu et al., "Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery" (https://doi.org/10.1145/3107411.3107424).

		Many types of models require their inputs to have a fixed shape. Since molecules can vary widely in the numbers of atoms and bonds they contain, this makes it hard to apply those models to them. We need a way of generating a fixed length "fingerprint" for each molecule. Various ways of doing this have been designed, such as Extended-Connectivity Fingerprints (ECFPs). But in this example, instead of designing a fingerprint by hand, we will let a `SeqToSeq` model learn its own method of creating fingerprints.

		A `SeqToSeq` model performs sequence to sequence translation. For example, they are often used to translate text from one language to another. It consists of two parts called the "encoder" and "decoder". The encoder is a stack of recurrent layers. The input sequence is fed into it, one token at a time, and it generates a fixed length vector called the "embedding vector". The decoder is another stack of recurrent layers that performs the inverse operation: it takes the embedding vector as input, and generates the output sequence. By training it on appropriately chosen input/output pairs, you can create a model that performs many sorts of transformations.

		In this case, we will use SMILES strings describing molecules as the input sequences. We will train the model as an autoencoder, so it tries to make the output sequences identical to the input sequences. For that to work, the encoder must create embedding vectors that contain all information from the original sequence. That's exactly what we want in a fingerprint, so perhaps those embedding vectors will then be useful as a way to represent molecules in other models!

		Let's start by loading the data. We will use the MUV dataset. It includes 74,501 molecules in the training set, and 9313 molecules in the validation set, so it gives us plenty of SMILES strings to work with.

		%% Cell type:code id: tags:

		``` python
		import deepchem as dc
		tasks, datasets, transformers = dc.molnet.load_muv()
		train_dataset, valid_dataset, test_dataset = datasets
		train_smiles = train_dataset.ids
		valid_smiles = valid_dataset.ids
		```

		%% Output

		About to load MUV dataset.
		Loading dataset from disk.
		Loading dataset from disk.
		Loading dataset from disk.

		%% Cell type:markdown id: tags:

		We need to define the "alphabet" for our `SeqToSeq` model, the list of all tokens that can appear in sequences. (It's also possible for input and output sequences to have different alphabets, but since we're training it as an autoencoder, they're identical in this case.) Make a list of every character that appears in any training sequence.

		%% Cell type:code id: tags:

		``` python
		tokens = set()
		for s in train_smiles:
		tokens = tokens.union(set(c for c in s))
		tokens = sorted(list(tokens))
		```

		%% Cell type:markdown id: tags:

		Create the model and define the optimization method to use. In this case, learning works much better if we gradually decrease the learning rate. We use an `ExponentialDecay` to multiply the learning rate by 0.9 after each epoch.

		%% Cell type:code id: tags:

		``` python
		from deepchem.models.optimizers import Adam, ExponentialDecay
		max_length = max(len(s) for s in train_smiles)
		model = dc.models.SeqToSeq(tokens,
		tokens,
		max_length,
		encoder_layers=2,
		decoder_layers=2,
		embedding_dimension=256,
		model_dir='fingerprint')
		model_dir='fingerprint',
		learning_rate=ExponentialDecay(0.004, 0.9, batches_per_epoch))
		batches_per_epoch = len(train_smiles)/model.batch_size
		model.set_optimizer(Adam(learning_rate=ExponentialDecay(0.004, 0.9, batches_per_epoch)))
		```

		%% Cell type:markdown id: tags:

		Let's train it! The input to `fit_sequences()` is a generator that produces input/output pairs. On a good GPU, this should take a few hours or less.

		%% Cell type:code id: tags:

		``` python
		def generate_sequences(epochs):
		for i in range(epochs):
		for s in train_smiles:
		yield (s, s)

		model.fit_sequences(generate_sequences(40))
		```

		%% Output

		Ending global_step 999: Average loss 72.0029
		Ending global_step 1999: Average loss 40.7221
		Ending global_step 2999: Average loss 31.5364
		Ending global_step 3999: Average loss 26.4576
		Ending global_step 4999: Average loss 22.814
		Ending global_step 5999: Average loss 19.5248
		Ending global_step 6999: Average loss 16.4594
		Ending global_step 7999: Average loss 18.8898
		Ending global_step 8999: Average loss 13.476
		Ending global_step 9999: Average loss 11.5528
		Ending global_step 10999: Average loss 10.1594
		Ending global_step 11999: Average loss 10.6434
		Ending global_step 12999: Average loss 6.57057
		Ending global_step 13999: Average loss 6.46177
		Ending global_step 14999: Average loss 7.53559
		Ending global_step 15999: Average loss 4.95809
		Ending global_step 16999: Average loss 4.35039
		Ending global_step 17999: Average loss 3.39137
		Ending global_step 18999: Average loss 3.5216
		Ending global_step 19999: Average loss 3.08579
		Ending global_step 20999: Average loss 2.80738
		Ending global_step 21999: Average loss 2.92217
		Ending global_step 22999: Average loss 2.51032
		Ending global_step 23999: Average loss 1.86265
		Ending global_step 24999: Average loss 1.67088
		Ending global_step 25999: Average loss 1.87016
		Ending global_step 26999: Average loss 1.61166
		Ending global_step 27999: Average loss 1.40708
		Ending global_step 28999: Average loss 1.4488
		Ending global_step 29801: Average loss 1.33917
		TIMING: model fitting took 5619.924 s

		%% Cell type:markdown id: tags:

		Let's see how well it works as an autoencoder. We'll run the first 500 molecules from the validation set through it, and see how many of them are exactly reproduced.

		%% Cell type:code id: tags:

		``` python
		predicted = model.predict_from_sequences(valid_smiles[:500])
		count = 0
		for s,p in zip(valid_smiles[:500], predicted):
		if ''.join(p) == s:
		count += 1
		print('reproduced', count, 'of 500 validation SMILES strings')
		```

		%% Output

		reproduced 363 of 500 validation SMILES strings

		%% Cell type:markdown id: tags:

		Now we'll trying using the encoder as a way to generate molecular fingerprints. We compute the embedding vectors for all molecules in the training and validation datasets, and create new datasets that have those as their feature vectors. The amount of data is small enough that we can just store everything in memory.

		%% Cell type:code id: tags:

		``` python
		train_embeddings = model.predict_embeddings(train_smiles)
		train_embeddings_dataset = dc.data.NumpyDataset(train_embeddings,
		train_dataset.y,
		train_dataset.w,
		train_dataset.ids)

		valid_embeddings = model.predict_embeddings(valid_smiles)
		valid_embeddings_dataset = dc.data.NumpyDataset(valid_embeddings,
		valid_dataset.y,
		valid_dataset.w,
		valid_dataset.ids)
		```

		%% Cell type:markdown id: tags:

		For classification, we'll use a simple fully connected network with one hidden layer.

		%% Cell type:code id: tags:

		``` python
		classifier = dc.models.MultitaskClassifier(n_tasks=len(tasks),
		n_features=256,
		layer_sizes=[512])
		classifier.fit(train_embeddings_dataset, nb_epoch=10)
		```

		%% Output

		Ending global_step 999: Average loss 829.805
		Ending global_step 1999: Average loss 450.42
		Ending global_step 2999: Average loss 326.079
		Ending global_step 3999: Average loss 265.199
		Ending global_step 4999: Average loss 246.724
		Ending global_step 5999: Average loss 224.64
		Ending global_step 6999: Average loss 202.624
		Ending global_step 7460: Average loss 213.885
		TIMING: model fitting took 19.780 s

		%% Cell type:markdown id: tags:

		Find out how well it worked. Compute the ROC AUC for the training and validation datasets.

		%% Cell type:code id: tags:

		``` python
		import numpy as np
		metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean, mode="classification")
		train_score = classifier.evaluate(train_embeddings_dataset, [metric], transformers)
		valid_score = classifier.evaluate(valid_embeddings_dataset, [metric], transformers)
		print('Training set ROC AUC:', train_score)
		print('Validation set ROC AUC:', valid_score)
		```

		%% Output

		computed_metrics: [0.97828427249789751, 0.98705973960125326, 0.966007068438685, 0.9874401066031584, 0.97794394675150698, 0.98021719680962449, 0.95318452689781941, 0.97185747562764213, 0.96389538770053473, 0.96798988621997473, 0.9690779239145807, 0.98544402211472004, 0.97762497271338133, 0.96843239633294886, 0.97753648081489997, 0.96504683675485614, 0.93547151958366914]
		computed_metrics: [0.90790686952512678, 0.79891461649782913, 0.61900937081659968, 0.75241212956581671, 0.58678903240426017, 0.72765072765072758, 0.34929006085192693, 0.83986814712005553, 0.82379943502824859, 0.61844636844636847, 0.863620199146515, 0.68106930272108857, 0.98020477815699669, 0.85073580939032944, 0.781015678254942, 0.75399733510992673, nan]
		Training set ROC AUC: {'mean-roc_auc_score': 0.97132433878689139}
		Validation set ROC AUC: {'mean-roc_auc_score': 0.74592061629292239}

examples/notebooks/solubility.ipynb

+1 −3

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		# Modeling Solubility
		Written by Bharath Ramsundar and Evan Feinberg

		Copyright 2016, Stanford University

		%% Cell type:markdown id: tags:

		Computationally predicting molecular solubility through is useful for drug-discovery. In this tutorial, we will use the `deepchem` library to fit a simple statistical model that predicts the solubility of drug-like compounds. The process of fitting this model involves four steps:

		1. Loading a chemical dataset, consisting of a series of compounds along with aqueous solubility measurements.
		2. Transforming each compound into a feature vector $v \in \mathbb{R}^n$ comprehensible to statistical learning methods.
		3. Fitting a simple model that maps feature vectors to estimates of aqueous solubility.
		4. Visualizing the results.

		%% Cell type:markdown id: tags:

		We need to load a dataset of estimated aqueous solubility measurements [1] into deepchem. The data is in CSV format and contains SMILES strings, predicted aqueaous solubilities, and a number of extraneous (for our purposes) molecular properties. Here is an example line from the dataset:

		\|Compound ID\|ESOL predicted log solubility in mols per litre\|Minimum Degree\|Molecular Weight\|Number of H-Bond Donors\|Number of Rings\|Number of Rotatable Bonds\|Polar Surface Area\|measured log solubility in mols per litre\|smiles\|
		\|-----------\|-----------------------------------------------\|--------------\|----------------\|----------------------------\|---------------\|-------------------------\|-----------------------------------------------------------------------\|------\|
		\|benzothiazole\|-2.733\|2\|135.191\|0\|2\|0\|12.89\|-1.5\|c2ccc1scnc1c2\|

		Most of these fields are not useful for our purposes. The two fields that we will need are the "smiles" field and the "measured log solubility in mols per litre". The "smiles" field holds a SMILES string [2] that specifies the compound in question. Before we load this data into deepchem, we will load the data into python and do some simple preliminary analysis to gain some intuition for the dataset.

		%% Cell type:code id: tags:

		``` python
		%autoreload 2
		%pdb off
		"""
		Not Currently Working
		"""
		from deepchem.utils.save import load_from_disk

		dataset_file= "../../datasets/delaney-processed.csv"
		dataset = load_from_disk(dataset_file)
		print("Columns of dataset: %s" % str(dataset.columns.values))
		print("Number of examples in dataset: %s" % str(dataset.shape[0]))
		```

		%% Output

		ERROR:root:Line magic function `%autoreload` not found.

		Automatic pdb calling has been turned OFF

		/home/leswing/anaconda3/envs/deepchem27/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
		"This module will be removed in 0.20.", DeprecationWarning)

		Columns of dataset: ['Compound ID' 'ESOL predicted log solubility in mols per litre'
		'Minimum Degree' 'Molecular Weight' 'Number of H-Bond Donors'
		'Number of Rings' 'Number of Rotatable Bonds' 'Polar Surface Area'
		'measured log solubility in mols per litre' 'smiles']
		Number of examples in dataset: 1128

		%% Cell type:markdown id: tags:

		To gain a visual understanding of compounds in our dataset, let's draw them using rdkit. We define a couple of helper functions to get started.

		%% Cell type:code id: tags:

		``` python
		import tempfile
		from rdkit import Chem
		from rdkit.Chem import Draw
		from itertools import islice
		from IPython.display import Image, HTML, display

		def display_images(filenames):
		"""Helper to pretty-print images."""
		imagesList=''.join(
		["<img style='width: 140px; margin: 0px; float: left; border: 1px solid black;' src='%s' />"
		% str(s) for s in sorted(filenames)])
		display(HTML(imagesList))

		def mols_to_pngs(mols, basename="test"):
		"""Helper to write RDKit mols to png files."""
		filenames = []
		for i, mol in enumerate(mols):
		filename = "%s%d.png" % (basename, i)
		Draw.MolToFile(mol, filename)
		filenames.append(filename)
		return filenames
		```

		%% Cell type:markdown id: tags:

		Now, we display some compounds from the dataset:

		%% Cell type:code id: tags:

		``` python
		num_to_display = 14
		molecules = []
		for _, data in islice(dataset.iterrows(), num_to_display):
		molecules.append(Chem.MolFromSmiles(data["smiles"]))
		display_images(mols_to_pngs(molecules))
		```

		%% Output


		%% Cell type:markdown id: tags:

		Analyzing the distribution of solubilities shows us a nice spread of data.

		%% Cell type:code id: tags:

		``` python
		%matplotlib inline
		import matplotlib
		import numpy as np
		import matplotlib.pyplot as plt

		solubilities = np.array(dataset["measured log solubility in mols per litre"])
		n, bins, patches = plt.hist(solubilities, 50, facecolor='green', alpha=0.75)
		plt.xlabel('Measured log-solubility in mols/liter')
		plt.ylabel('Number of compounds')
		plt.title(r'Histogram of solubilities')
		plt.grid(True)
		plt.show()
		```

		%% Output



		%% Cell type:markdown id: tags:

		With our preliminary analysis completed, we return to the original goal of constructing a predictive statistical model of molecular solubility using `deepchem`. The first step in creating such a molecule is translating each compound into a vectorial format that can be understood by statistical learning techniques. This process is commonly called featurization. `deepchem` packages a number of commonly used featurization for user convenience. In this tutorial, we will use ECPF4 fingeprints [3].

		`deepchem` offers an object-oriented API for featurization. To get started with featurization, we first construct a ```Featurizer``` object. `deepchem` provides the ```CircularFingeprint``` class (a subclass of ```Featurizer``` that performs ECFP4 featurization).

		%% Cell type:code id: tags:

		``` python
		import deepchem as dc

		featurizer = dc.feat.CircularFingerprint(size=1024)
		```

		%% Cell type:markdown id: tags:

		Now, let's perform the actual featurization. `deepchem` provides the ```CSVLoader``` class for this purpose. The ```featurize()``` method for this class loads data from disk and uses provided ```Featurizer```instances to transform the provided data into feature vectors.

		To perform machine learning upon these datasets, we need to convert the samples into datasets suitable for machine-learning (that is, into data matrix $X \in \mathbb{R}^{n\times d}$ where $n$ is the number of samples and $d$ the dimensionality of the feature vector, and into label vector $y \in \mathbb{R}^n$). `deepchem` provides the `Dataset` class to facilitate this transformation. This style lends itself easily to validation-set hyperparameter searches, which we illustate below.

		%% Cell type:code id: tags:

		``` python
		loader = dc.data.CSVLoader(
		tasks=["measured log solubility in mols per litre"], smiles_field="smiles",
		featurizer=featurizer)
		dataset = loader.featurize(dataset_file)
		```

		%% Output

		Loading raw samples now.
		shard_size: 8192
		About to start loading CSV from ../../datasets/delaney-processed.csv
		Loading shard 1 of size 8192.
		Featurizing sample 0
		Featurizing sample 1000
		TIMING: featurizing shard 0 took 2.016 s
		TIMING: dataset construction took 2.061 s
		Loading dataset from disk.

		/home/leswing/Documents/deepchem/deepchem/data/data_loader.py:42: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
		if y[ind, task] == "":

		%% Cell type:markdown id: tags:

		When constructing statistical models, it's necessary to separate the provided data into train/test subsets. The train subset is used to learn the statistical model, while the test subset is used to evaluate the learned model. In practice, it's often useful to elaborate this split further and perform a train/validation/test split. The validation set is used to perform model selection. Proposed models are evaluated on the validation-set, and the best performed model is at the end tested on the test-set.

		Choosing the proper method of performing a train/validation/test split can be challenging. Standard machine learning practice is to perform a random split of the data into train/validation/test, but random splits are not well suited for the purposes of chemical informatics. For our predictive models to be useful, we require them to have predictive power in portions of chemical space beyond the set of molecules in the training data. Consequently, our models should use splits of the data that separate compounds in the training set from those in the validation and test-sets. We use Bemis-Murcko scaffolds [5] to perform this separation (all compounds that share an underlying molecular scaffold will be placed into the same split in the train/test/validation split).

		%% Cell type:code id: tags:

		``` python
		splitter = dc.splits.ScaffoldSplitter(dataset_file)
		train_dataset, valid_dataset, test_dataset = splitter.train_valid_test_split(
		dataset)
		```

		%% Output

		Computing train/valid/test indices
		About to generate scaffolds
		Generating scaffold 0/1128
		Generating scaffold 1000/1128
		About to sort in scaffold sets
		TIMING: dataset construction took 0.058 s
		Loading dataset from disk.
		TIMING: dataset construction took 0.027 s
		Loading dataset from disk.
		TIMING: dataset construction took 0.030 s
		Loading dataset from disk.

		%% Cell type:markdown id: tags:

		Let's visually inspect some of the molecules in the separate splits to verify that they appear structurally dissimilar. The `FeaturizedSamples` class provides an `itersamples` method that lets us obtain the underlying compounds in each split.

		%% Cell type:code id: tags:

		``` python
		train_mols = [Chem.MolFromSmiles(compound)
		for compound in train_dataset.ids]
		display_images(mols_to_pngs(train_mols, basename="train"))
		```

		%% Output


		%% Cell type:code id: tags:

		``` python
		valid_mols = [Chem.MolFromSmiles(compound)
		for compound in valid_dataset.ids]
		display_images(mols_to_pngs(valid_mols, basename="valid"))
		```

		%% Output


		%% Cell type:markdown id: tags:

		Notice the visual distinction between the train/validation splits. The most-common scaffolds are reserved for the train split, with the rarer scaffolds allotted to validation/test.

		%% Cell type:markdown id: tags:

		The performance of common machine-learning algorithms can be very sensitive to preprocessing of the data. One common transformation applied to data is to normalize it to have zero-mean and unit-standard-deviation. We will apply this transformation to the log-solubility (as seen above, the log-solubility ranges from -12 to 2).

		%% Cell type:code id: tags:

		``` python
		transformers = [
		dc.trans.NormalizationTransformer(transform_y=True, dataset=train_dataset)]

		for dataset in [train_dataset, valid_dataset, test_dataset]:
		for transformer in transformers:
		dataset = transformer.transform(dataset)
		```

		%% Output

		TIMING: dataset construction took 0.042 s
		Loading dataset from disk.
		TIMING: dataset construction took 0.009 s
		Loading dataset from disk.
		TIMING: dataset construction took 0.009 s
		Loading dataset from disk.

		%% Cell type:markdown id: tags:

		The next step after processing the data is to start fitting simple learning models to our data. `deepchem` provides a number of machine-learning model classes.

		In particular, `deepchem` provides a convenience class, ```SklearnModel``` that wraps any machine-learning model available in scikit-learn [6]. Consequently, we will start by building a simple random-forest regressor that attempts to predict the log-solubility from our computed ECFP4 features. To train the model, we instantiate the ```SklearnModel``` object, then call the ```fit()``` method on the ```train_dataset``` we constructed above. We then save the model to disk.

		%% Cell type:code id: tags:

		``` python
		from sklearn.ensemble import RandomForestRegressor

		sklearn_model = RandomForestRegressor(n_estimators=100)
		model = dc.models.SklearnModel(sklearn_model)
		model.fit(train_dataset)
		```

		%% Cell type:markdown id: tags:

		We next evaluate the model on the validation set to see its predictive power. `deepchem` provides the `Evaluator` class to facilitate this process. To evaluate the constructed `model` object, create a new `Evaluator` instance and call the `compute_model_performance()` method.

		%% Cell type:code id: tags:

		``` python
		from deepchem.utils.evaluate import Evaluator

		metric = dc.metrics.Metric(dc.metrics.r2_score)
		evaluator = Evaluator(model, valid_dataset, transformers)
		r2score = evaluator.compute_model_performance([metric])
		print(r2score)
		```

		%% Output

		computed_metrics: [0.15666969194533098]
		{'r2_score': 0.15666969194533098}

		%% Cell type:markdown id: tags:

		The performance of this basic random-forest model isn't very strong. To construct stronger models, let's attempt to optimize the hyperparameters (choices made in the model-specification) to achieve better performance. For random forests, we can tweak `n_estimators` which controls the number of trees in the forest, and `max_features` which controls the number of features to consider when performing a split. We now build a series of `SklearnModel`s with different choices for `n_estimators` and `max_features` and evaluate performance on the validation set.

		%% Cell type:code id: tags:

		``` python
		def rf_model_builder(model_params, model_dir):
		sklearn_model = RandomForestRegressor(**model_params)
		return dc.models.SklearnModel(sklearn_model, model_dir)
		params_dict = {
		"n_estimators": [10, 100],
		"max_features": ["auto", "sqrt", "log2", None],
		}

		metric = dc.metrics.Metric(dc.metrics.r2_score)
		optimizer = dc.hyper.HyperparamOpt(rf_model_builder)
		best_rf, best_rf_hyperparams, all_rf_results = optimizer.hyperparam_search(
		params_dict, train_dataset, valid_dataset, transformers,
		metric=metric)
		```

		%% Output

		Fitting model 1/8
		hyperparameters: {'n_estimators': 10, 'max_features': 'auto'}
		computed_metrics: [0.16084049792955879]
		Model 1/8, Metric r2_score, Validation set 0: 0.160840
		best_validation_score so far: 0.160840
		Fitting model 2/8
		hyperparameters: {'n_estimators': 10, 'max_features': 'sqrt'}
		computed_metrics: [0.24188860561412873]
		Model 2/8, Metric r2_score, Validation set 1: 0.241889
		best_validation_score so far: 0.241889
		Fitting model 3/8
		hyperparameters: {'n_estimators': 10, 'max_features': 'log2'}
		computed_metrics: [0.13923134159755857]
		Model 3/8, Metric r2_score, Validation set 2: 0.139231
		best_validation_score so far: 0.241889
		Fitting model 4/8
		hyperparameters: {'n_estimators': 10, 'max_features': None}
		computed_metrics: [0.068066979889086832]
		Model 4/8, Metric r2_score, Validation set 3: 0.068067
		best_validation_score so far: 0.241889
		Fitting model 5/8
		hyperparameters: {'n_estimators': 100, 'max_features': 'auto'}
		computed_metrics: [0.13350083343444363]
		Model 5/8, Metric r2_score, Validation set 4: 0.133501
		best_validation_score so far: 0.241889
		Fitting model 6/8
		hyperparameters: {'n_estimators': 100, 'max_features': 'sqrt'}
		computed_metrics: [0.27500018179983798]
		Model 6/8, Metric r2_score, Validation set 5: 0.275000
		best_validation_score so far: 0.275000
		Fitting model 7/8
		hyperparameters: {'n_estimators': 100, 'max_features': 'log2'}
		computed_metrics: [0.24719988102520696]
		Model 7/8, Metric r2_score, Validation set 6: 0.247200
		best_validation_score so far: 0.275000
		Fitting model 8/8
		hyperparameters: {'n_estimators': 100, 'max_features': None}
		computed_metrics: [0.15709535184995427]
		Model 8/8, Metric r2_score, Validation set 7: 0.157095
		best_validation_score so far: 0.275000
		computed_metrics: [0.94421662988139621]
		Best hyperparameters: (100, 'sqrt')
		train_score: 0.944217
		validation_score: 0.275000

		%% Cell type:markdown id: tags:

		The best model achieves significantly higher $R^2$ on the validation set than the first model we constructed. Now, let's perform the same sort of hyperparameter search, but with a simple deep-network instead.

		%% Cell type:code id: tags:

		``` python
		import numpy.random

		params_dict = {"learning_rate": np.power(10., np.random.uniform(-5, -3, size=1)),
		"decay": np.power(10, np.random.uniform(-6, -4, size=1)),
		"nb_epoch": [20] }
		n_features = train_dataset.get_data_shape()[0]
		def model_builder(model_params, model_dir):
		model = dc.models.TensorflowMultitaskRegressor(
		model = dc.models.MultitaskRegressor(
		1, n_features, layer_sizes=[1000], dropouts=[.25],
		batch_size=50, **model_params)
		return model

		optimizer = dc.hyper.HyperparamOpt(model_builder)
		best_dnn, best_dnn_hyperparams, all_dnn_results = optimizer.hyperparam_search(
		params_dict, train_dataset, valid_dataset, transformers,
		metric=metric)
		```

		%% Output

		Fitting model 1/1
		hyperparameters: {'learning_rate': 0.00023048114774615143, 'nb_epoch': 20, 'decay': 4.6386701591998651e-06}
		Training for 20 epochs
		On batch 0
		Ending epoch 0: Average loss 2.90304
		On batch 0
		Ending epoch 1: Average loss 1.80178
		On batch 0
		Ending epoch 2: Average loss 1.49158
		On batch 0
		Ending epoch 3: Average loss 1.32873
		On batch 0
		Ending epoch 4: Average loss 1.1639
		On batch 0
		Ending epoch 5: Average loss 1.00606
		On batch 0
		Ending epoch 6: Average loss 0.93537
		On batch 0
		Ending epoch 7: Average loss 0.866738
		On batch 0
		Ending epoch 8: Average loss 0.785585
		On batch 0
		Ending epoch 9: Average loss 0.735722
		On batch 0
		Ending epoch 10: Average loss 0.693006
		On batch 0
		Ending epoch 11: Average loss 0.620769
		On batch 0
		Ending epoch 12: Average loss 0.59393
		On batch 0
		Ending epoch 13: Average loss 0.565747
		On batch 0
		Ending epoch 14: Average loss 0.558274
		On batch 0
		Ending epoch 15: Average loss 0.522325
		On batch 0
		Ending epoch 16: Average loss 0.52158
		On batch 0
		Ending epoch 17: Average loss 0.483039
		On batch 0
		Ending epoch 18: Average loss 0.472697
		On batch 0
		Ending epoch 19: Average loss 0.463411
		On batch 0
		TIMING: model fitting took 4.142 s True
		computed_metrics: [0.19197438493808416]
		Model 1/1, Metric r2_score, Validation set 0: 0.191974
		best_validation_score so far: 0.191974
		computed_metrics: [0.81127373811679659]
		Best hyperparameters: (0.00023048114774615143, 20, 4.6386701591998651e-06)
		train_score: 0.811274
		validation_score: 0.191974

		%% Cell type:markdown id: tags:

		Now that we have a reasonable choice of hyperparameters, let's evaluate the performance of our best models on the test-set.

		%% Cell type:code id: tags:

		``` python
		rf_test_evaluator = Evaluator(best_rf, test_dataset, transformers)
		rf_test_r2score = rf_test_evaluator.compute_model_performance([metric])
		print("RF Test set R^2 %f" % (rf_test_r2score["r2_score"]))
		```

		%% Output

		computed_metrics: [0.36485280404796627]
		RF Test set R^2 0.364853

		%% Cell type:code id: tags:

		``` python
		dnn_test_evaluator = Evaluator(best_dnn, test_dataset, transformers)
		dnn_test_r2score = dnn_test_evaluator.compute_model_performance([metric])
		print("DNN Test set R^2 %f" % (dnn_test_r2score["r2_score"]))
		```

		%% Output

		computed_metrics: [0.24496914941196501]
		DNN Test set R^2 0.244969

		%% Cell type:markdown id: tags:

		Now, let's plot the predicted $R^2$ scores versus the true $R^2$ scores for the constructed model.

		%% Cell type:code id: tags:

		``` python
		task = "measured log solubility in mols per litre"
		predicted_test = best_rf.predict(test_dataset)
		true_test = test_dataset.y
		plt.scatter(predicted_test, true_test)
		plt.xlabel('Predicted log-solubility in mols/liter')
		plt.ylabel('True log-solubility in mols/liter')
		plt.title(r'RF- predicted vs. true log-solubilities')
		plt.show()
		```

		%% Output



		%% Cell type:code id: tags:

		``` python
		task = "measured log solubility in mols per litre"
		predicted_test = best_dnn.predict(test_dataset)
		true_test = test_dataset.y
		plt.scatter(predicted_test, true_test)
		plt.xlabel('Predicted log-solubility in mols/liter')
		plt.ylabel('True log-solubility in mols/liter')
		plt.title(r'DNN predicted vs. true log-solubilities')
		plt.show()
		```

		%% Output



		%% Cell type:markdown id: tags:

		[1] John S. Delaney. ESOL: Estimating aqueous solubility directly from molecular structure. Journal
		of Chemical Information and Computer Sciences, 44(3):1000–1005, 2004.

		[2] Anderson, Eric, Gilman D. Veith, and David Weininger. SMILES, a line notation and computerized
		interpreter for chemical structures. US Environmental Protection Agency, Environmental Research Laboratory, 1987.

		[3] Rogers, David, and Mathew Hahn. "Extended-connectivity fingerprints." Journal of chemical information
		and modeling 50.5 (2010): 742-754.

		[4] Van Der Walt, Stefan, S. Chris Colbert, and Gael Varoquaux.
		"The NumPy array:a structure for efficient numerical computation." Computing in Science & Engineering 13.2 (2011): 22-30.

		[5] Bemis, Guy W., and Mark A. Murcko. "The properties of known drugs. 1. Molecular frameworks."
		Journal of medicinal chemistry 39.15 (1996): 2887-2893.

		[6] Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python." The Journal of Machine Learning Research 12 (2011): 2825-2830.

Admin message