Merge branch 'master' of https://github.com/deepchem/deepchem into rmdir (f8880708) · Commits · 钟慕尧 / deepchem

deepchem/utils/save.py

+27 −14

Original line number	Diff line number	Diff line
		@@ -185,11 +185,28 @@ def load_pickle_from_disk(filename):


		def load_dataset_from_disk(save_dir):
		"""
		Parameters
		----------
		save_dir: str

		Returns
		-------
		loaded: bool
		Whether the load succeeded
		all_dataset: (dc.data.Dataset, dc.data.Dataset, dc.data.Dataset)
		The train, valid, test datasets
		transformers: list of dc.trans.Transformer
		The transformers used for this dataset

		"""

		train_dir = os.path.join(save_dir, "train_dir")
		valid_dir = os.path.join(save_dir, "valid_dir")
		test_dir = os.path.join(save_dir, "test_dir")
		if os.path.exists(train_dir) and os.path.exists(valid_dir) and os.path.exists(
		test_dir):
		if not os.path.exists(train_dir) or not os.path.exists(
		valid_dir) or not os.path.exists(test_dir):
		return False, None, list()
		loaded = True
		train = deepchem.data.DiskDataset(train_dir)
		valid = deepchem.data.DiskDataset(valid_dir)
		@@ -197,10 +214,6 @@ def load_dataset_from_disk(save_dir):
		all_dataset = (train, valid, test)
		with open(os.path.join(save_dir, "transformers.pkl"), 'rb') as f:
		transformers = pickle.load(f)
		else:
		loaded = False
		all_dataset = None
		transformers = []
		return loaded, all_dataset, transformers

examples/notebooks/graph_convolutional_networks_for_tox21.ipynb

+45 −22

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		# Graph Convolutions For Tox21
		In this notebook, we will explore the use of TensorGraph to create graph convolutional models with DeepChem. In particular, we will build a graph convolutional network on the Tox21 dataset.

		Let's start with some basic imports.

		%% Cell type:code id: tags:

		``` python
		from __future__ import division
		from __future__ import print_function
		from __future__ import unicode_literals

		import numpy as np
		import tensorflow as tf
		import deepchem as dc
		from deepchem.models.tensorgraph.models.graph_models import GraphConvTensorGraph
		```

		%% Cell type:markdown id: tags:

		Now, let's use MoleculeNet to load the Tox21 dataset. We need to make sure to process the data in a way that graph convolutional networks can use For that, we make sure to set the featurizer option to 'GraphConv'. The MoleculeNet call will return a training set, an validation set, and a test set for us to use. The call also returns `transformers`, a list of data transformations that were applied to preprocess the dataset. (Most deep networks are quite finicky and require a set of data transformations to ensure that training proceeds stably.)

		%% Cell type:code id: tags:

		``` python
		# Load Tox21 dataset
		tox21_tasks, tox21_datasets, transformers = dc.molnet.load_tox21(featurizer='GraphConv')
		train_dataset, valid_dataset, test_dataset = tox21_datasets
		```

		%% Output

		Loading dataset from disk.
		Loading dataset from disk.
		Loading dataset from disk.

		%% Cell type:markdown id: tags:

		Let's now train a graph convolutional network on this dataset. DeepChem has the class `GraphConvTensorGraph` that wraps a standard graph convolutional architecture underneath the hood for user convenience. Let's instantiate an object of this class and train it on our dataset.

		%% Cell type:code id: tags:

		``` python
		model = GraphConvTensorGraph(
		len(tox21_tasks), batch_size=50, mode='classification')
		# Set nb_epoch=10 for better results.
		model.fit(train_dataset, nb_epoch=1)
		```

		%% Output

		/home/rbharath/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:91: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
		/home/leswing/miniconda3/envs/deepchem/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py:95: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
		"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

		Starting epoch 0
		Ending global_step 129: Average loss 586.57
		TIMING: model fitting took 17.082 s
		Ending global_step 126: Average loss 590.739
		TIMING: model fitting took 7.317 s

		590.7391258118645

		%% Cell type:markdown id: tags:

		Let's try to evaluate the performance of the model we've trained. For this, we need to define a metric, a measure of model performance. `dc.metrics` holds a collection of metrics already. For this dataset, it is standard to use the ROC-AUC score, the area under the receiver operating characteristic curve (which measures the tradeoff between precision and recall). Luckily, the ROC-AUC score is already available in DeepChem.

		To measure the performance of the model under this metric, we can use the convenience function `model.evaluate()`.

		%% Cell type:code id: tags:

		``` python
		metric = dc.metrics.Metric(
		dc.metrics.roc_auc_score, np.mean, mode="classification")

		print("Evaluating model")
		train_scores = model.evaluate(train_dataset, [metric], transformers)
		print("Training ROC-AUC Score: %f" % train_scores["mean-roc_auc_score"])
		valid_scores = model.evaluate(valid_dataset, [metric], transformers)
		print("Validation ROC-AUC Score: %f" % valid_scores["mean-roc_auc_score"])
		```

		%% Output

		Evaluating model
		computed_metrics: [0.82355634419908874, 0.84317769495469375, 0.83831967490229498, 0.78265073415417696, 0.67514582634690545, 0.78648566994877722, 0.71947133017380849, 0.68636368747841314, 0.77653631284916202, 0.66689869847282113, 0.77872709187364375, 0.74727748235240998]
		Training ROC-AUC Score: 0.760384
		computed_metrics: [0.76064908722109537, 0.82394038192827201, 0.83087572150072142, 0.7314288959563835, 0.60984975330966895, 0.60837196235426316, 0.59852764937510705, 0.65035114358925683, 0.71262721074992585, 0.58264411027568919, 0.71303258145363402, 0.66946153846153844]
		Validation ROC-AUC Score: 0.690980
		computed_metrics: [0.80045699830862893, 0.83618637604367374, 0.83908539936708681, 0.77873855933094183, 0.67692252993044244, 0.75578036941489168, 0.75895796821704797, 0.70234314980793855, 0.76387081283102387, 0.65924917162534913, 0.78448201448364341, 0.76675448900822296]
		Training ROC-AUC Score: 0.760236
		computed_metrics: [0.71533171721169553, 0.74090608465608465, 0.81106357802757933, 0.70627859684799188, 0.63177272727272715, 0.6326016835811501, 0.61491865697473169, 0.71286314850043442, 0.67676006592889104, 0.51656328658755846, 0.75414979999520937, 0.6603359173126615]
		Validation ROC-AUC Score: 0.681129

		%% Cell type:markdown id: tags:

		What's going on under the hood? Could we build `GraphConvTensorGraph` ourselves? Of course! The first step is to create a `TensorGraph` object. This object will hold the "computational graph" that defines the computation that a graph convolutional network will perform.

		%% Cell type:code id: tags:

		``` python
		from deepchem.models.tensorgraph.tensor_graph import TensorGraph

		tg = TensorGraph(use_queue=False)
		```

		%% Cell type:markdown id: tags:

		Let's now define the inputs to our model. Conceptually, graph convolutions just requires a the structure of the molecule in question and a vector of features for every atom that describes the local chemical environment. However in practice, due to TensorFlow's limitations as a general programming environment, we have to have some auxiliary information as well preprocessed.

		`atom_features` holds a feature vector of length 75 for each atom. The other feature inputs are required to support minibatching in TensorFlow. `degree_slice` is an indexing convenience that makes it easy to locate atoms from all molecules with a given degree. `membership` determines the membership of atoms in molecules (atom `i` belongs to molecule `membership[i]`). `deg_adjs` is a list that contains adjacency lists grouped by atom degree For more details, check out the [code](https://github.com/deepchem/deepchem/blob/master/deepchem/feat/mol_graphs.py).

		To define feature inputs in `TensorGraph`, we use the `Feature` layer. Conceptually, a `TensorGraph` is a mathematical graph composed of layer objects. `Features` layers have to be the root nodes of the graph since they consitute inputs.

		%% Cell type:code id: tags:

		``` python
		from deepchem.models.tensorgraph.layers import Feature

		atom_features = Feature(shape=(None, 75))
		degree_slice = Feature(shape=(None, 2), dtype=tf.int32)
		membership = Feature(shape=(None,), dtype=tf.int32)

		deg_adjs = []
		for i in range(0, 10 + 1):
		deg_adj = Feature(shape=(None, i + 1), dtype=tf.int32)
		deg_adjs.append(deg_adj)
		```

		%% Cell type:markdown id: tags:

		Let's now implement the body of the graph convolutional network. `TensorGraph` has a number of layers that encode various graph operations. Namely, the `GraphConv`, `GraphPool` and `GraphGather` layers. We will also apply standard neural network layers such as `Dense` and `BatchNorm`.

		The layers we're adding effect a "feature transformation" that will create one vector for each molecule.

		%% Cell type:code id: tags:

		``` python
		from deepchem.models.tensorgraph.layers import Dense, GraphConv, BatchNorm
		from deepchem.models.tensorgraph.layers import GraphPool, GraphGather

		batch_size = 50

		gc1 = GraphConv(
		64,
		activation_fn=tf.nn.relu,
		in_layers=[atom_features, degree_slice, membership] + deg_adjs)
		batch_norm1 = BatchNorm(in_layers=[gc1])
		gp1 = GraphPool(in_layers=[batch_norm1, degree_slice, membership] + deg_adjs)
		gc2 = GraphConv(
		64,
		activation_fn=tf.nn.relu,
		in_layers=[gp1, degree_slice, membership] + deg_adjs)
		batch_norm2 = BatchNorm(in_layers=[gc2])
		gp2 = GraphPool(in_layers=[batch_norm2, degree_slice, membership] + deg_adjs)
		dense = Dense(out_channels=128, activation_fn=tf.nn.relu, in_layers=[gp2])
		batch_norm3 = BatchNorm(in_layers=[dense])
		readout = GraphGather(
		batch_size=batch_size,
		activation_fn=tf.nn.tanh,
		in_layers=[batch_norm3, degree_slice, membership] + deg_adjs)
		```

		%% Cell type:markdown id: tags:

		Let's now make predictions from the `TensorGraph` model. Tox21 is a multitask dataset. That is, there are 12 different datasets grouped together, which share many common molecules, but with different outputs for each. As a result, we have to add a separate output layer for each task. We will use a `for` loop over the `tox21_tasks` list to make this happen. We need to add labels for each

		We also have to define a loss for the model which tells the network the objective to minimize during training.

		We have to tell `TensorGraph` which layers are outputs with `TensorGraph.add_output(layer)`. Similarly, we tell the network its loss with `TensorGraph.set_loss(loss)`.

		%% Cell type:code id: tags:

		``` python
		from deepchem.models.tensorgraph.layers import Dense, SoftMax, \
		SoftMaxCrossEntropy, WeightedError, Concat
		SoftMaxCrossEntropy, WeightedError, Stack
		from deepchem.models.tensorgraph.layers import Label, Weights

		costs = []
		labels = []
		for task in range(len(tox21_tasks)):
		classification = Dense(
		out_channels=2, activation_fn=None, in_layers=[readout])

		softmax = SoftMax(in_layers=[classification])
		tg.add_output(softmax)

		label = Label(shape=(None, 2))
		labels.append(label)
		cost = SoftMaxCrossEntropy(in_layers=[label, classification])
		costs.append(cost)
		all_cost = Concat(in_layers=costs, axis=1)
		all_cost = Stack(in_layers=costs, axis=1)
		weights = Weights(shape=(None, len(tox21_tasks)))
		loss = WeightedError(in_layers=[all_cost, weights])
		tg.set_loss(loss)
		```

		%% Cell type:markdown id: tags:

		Now that we've successfully defined our graph convolutional model in `TensorGraph`, we need to train it. We can call `fit()`, but we need to make sure that each minibatch of data populates all four `Feature` objects that we've created. For this, we need to create a Python generator that given a batch of data generates a dictionary whose keys are the `Feature` layers and whose values are Numpy arrays we'd like to use for this step of training.

		%% Cell type:code id: tags:

		``` python
		from deepchem.metrics import to_one_hot
		from deepchem.feat.mol_graphs import ConvMol

		def data_generator(dataset, epochs=1, predict=False, pad_batches=True):
		for epoch in range(epochs):
		if not predict:
		print('Starting epoch %i' % epoch)
		for ind, (X_b, y_b, w_b, ids_b) in enumerate(
		dataset.iterbatches(
		batch_size, pad_batches=True, deterministic=True)):
		batch_size, pad_batches=pad_batches, deterministic=True)):
		d = {}
		for index, label in enumerate(labels):
		d[label] = to_one_hot(y_b[:, index])
		d[weights] = w_b
		multiConvMol = ConvMol.agglomerate_mols(X_b)
		d[atom_features] = multiConvMol.get_atom_features()
		d[degree_slice] = multiConvMol.deg_slice
		d[membership] = multiConvMol.membership
		for i in range(1, len(multiConvMol.get_deg_adjacency_lists())):
		d[deg_adjs[i - 1]] = multiConvMol.get_deg_adjacency_lists()[i]
		yield d
		```

		%% Cell type:markdown id: tags:

		Now, we can train the model using `TensorGraph.fit_generator(generator)` which will use the generator we've defined to train the model.

		%% Cell type:code id: tags:

		``` python
		# Epochs set to 1 to render tutorials online.
		# Set epochs=10 for better results.
		tg.fit_generator(data_generator(train_dataset, epochs=1))
		```

		%% Output

		Starting epoch 0
		Ending global_step 129: Average loss 586.139
		TIMING: model fitting took 21.158 s
		Ending global_step 251: Average loss 530.84
		TIMING: model fitting took 6.949 s

		530.8396410260882

		%% Cell type:markdown id: tags:

		Now that we have trained our graph convolutional method, let's evaluate its performance. We again have to use our defined generator to evaluate model performance.

		%% Cell type:code id: tags:

		``` python
		metric = dc.metrics.Metric(
		dc.metrics.roc_auc_score, np.mean, mode="classification")

		def reshape_y_pred(y_true, y_pred):
		"""
		TensorGraph.Predict returns a list of arrays, one for each output
		We also have to remove the padding on the last batch
		Metrics taks results of shape (samples, n_task, prob_of_class)
		"""
		n_samples = len(y_true)
		retval = np.stack(y_pred, axis=1)
		return retval[:n_samples]


		print("Evaluating model")
		train_scores = tg.evaluate_generator(data_generator(train_dataset, predict=True),
		[metric], labels=labels, weights=[weights])
		print("Training ROC-AUC Score: %f" % train_scores["mean-roc_auc_score"])
		valid_scores = tg.evaluate_generator(data_generator(valid_dataset, predict=True),
		[metric], labels=labels, weights=[weights])
		print("Valid ROC-AUC Score: %f" % valid_scores["mean-roc_auc_score"])
		train_predictions = tg.predict_on_generator(data_generator(train_dataset, predict=True))
		train_predictions = reshape_y_pred(train_dataset.y, train_predictions)
		train_scores = metric.compute_metric(train_dataset.y, train_predictions, train_dataset.w)
		print("Training ROC-AUC Score: %f" % train_scores)

		valid_predictions = tg.predict_on_generator(data_generator(valid_dataset, predict=True))
		valid_predictions = reshape_y_pred(valid_dataset.y, valid_predictions)
		valid_scores = metric.compute_metric(valid_dataset.y, valid_predictions, valid_dataset.w)
		print("Valid ROC-AUC Score: %f" % valid_scores)
		```

		%% Output

		Evaluating model
		computed_metrics: [0.48489674611520039, 0.49062127172873471, 0.49497985043341852, 0.50670994717906093, 0.47636546814871089, 0.50538527032779901, 0.47652531174729351, 0.49406973662015952, 0.49345587239947747, 0.51160175820338205, 0.48764342697532359, 0.505495556795303]
		Training ROC-AUC Score: 0.493979
		computed_metrics: [0.49546025306674396, 0.44814469802825652, 0.51528679653679654, 0.50477055883689226, 0.50335989998437258, 0.5068829891838742, 0.57539376819037835, 0.50612128689380276, 0.45944076672265588, 0.58692564745196329, 0.48453889353012158, 0.48315384615384616]
		Valid ROC-AUC Score: 0.505790
		computed_metrics: [0.83463194036351052, 0.86218739964675661, 0.84894031662657832, 0.80217986671584707, 0.70559942152332189, 0.79751934844253025, 0.8103057689046107, 0.71659210162938414, 0.80849247997327445, 0.72071717294380933, 0.83433314746710274, 0.78304357554399506]
		Training ROC-AUC Score: 0.793712
		computed_metrics: [0.78221936377578793, 0.78993055555555547, 0.81705388431256543, 0.77777071682765631, 0.66802272727272727, 0.67197702777122181, 0.64295604015230179, 0.72305596655628368, 0.74692724275959499, 0.63050611290902547, 0.80023473616134511, 0.73880275624461667]
		Valid ROC-AUC Score: 0.732455

		%% Cell type:markdown id: tags:

		Success! The model we've constructed behaves nearly identically to `GraphConvTensorGraph`. If you're looking to build your own custom models, you can follow the example we've provided here to do so. We hope to see exciting constructions from your end soon!

		%% Cell type:code id: tags:

		``` python
		```

Admin message