Commit f8880708 authored by Bharath Ramsundar's avatar Bharath Ramsundar
Browse files

Merge branch 'master' of https://github.com/deepchem/deepchem into rmdir

parents 0b357519 ac6ceac0
Loading
Loading
Loading
Loading
+27 −14
Original line number Diff line number Diff line
@@ -185,11 +185,28 @@ def load_pickle_from_disk(filename):


def load_dataset_from_disk(save_dir):
  """
  Parameters
  ----------
  save_dir: str

  Returns
  -------
  loaded: bool
    Whether the load succeeded
  all_dataset: (dc.data.Dataset, dc.data.Dataset, dc.data.Dataset)
    The train, valid, test datasets
  transformers: list of dc.trans.Transformer
    The transformers used for this dataset

  """

  train_dir = os.path.join(save_dir, "train_dir")
  valid_dir = os.path.join(save_dir, "valid_dir")
  test_dir = os.path.join(save_dir, "test_dir")
  if os.path.exists(train_dir) and os.path.exists(valid_dir) and os.path.exists(
      test_dir):
  if not os.path.exists(train_dir) or not os.path.exists(
      valid_dir) or not os.path.exists(test_dir):
    return False, None, list()
  loaded = True
  train = deepchem.data.DiskDataset(train_dir)
  valid = deepchem.data.DiskDataset(valid_dir)
@@ -197,10 +214,6 @@ def load_dataset_from_disk(save_dir):
  all_dataset = (train, valid, test)
  with open(os.path.join(save_dir, "transformers.pkl"), 'rb') as f:
    transformers = pickle.load(f)
  else:
    loaded = False
    all_dataset = None
    transformers = []
    return loaded, all_dataset, transformers


+45 −22
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Graph Convolutions For Tox21
In this notebook, we will explore the use of TensorGraph to create graph convolutional models with DeepChem. In particular, we will build a graph convolutional network on the Tox21 dataset.

Let's start with some basic imports.

%% Cell type:code id: tags:

``` python
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import numpy as np
import tensorflow as tf
import deepchem as dc
from deepchem.models.tensorgraph.models.graph_models import GraphConvTensorGraph
```

%% Cell type:markdown id: tags:

Now, let's use MoleculeNet to load the Tox21 dataset. We need to make sure to process the data in a way that graph convolutional networks can use For that, we make sure to set the featurizer option to 'GraphConv'. The MoleculeNet call will return a training set, an validation set, and a test set for us to use. The call also returns `transformers`, a list of data transformations that were applied to preprocess the dataset. (Most deep networks are quite finicky and require a set of data transformations to ensure that training proceeds stably.)

%% Cell type:code id: tags:

``` python
# Load Tox21 dataset
tox21_tasks, tox21_datasets, transformers = dc.molnet.load_tox21(featurizer='GraphConv')
train_dataset, valid_dataset, test_dataset = tox21_datasets
```

%% Output

    Loading dataset from disk.
    Loading dataset from disk.
    Loading dataset from disk.

%% Cell type:markdown id: tags:

Let's now train a graph convolutional network on this dataset. DeepChem has the class `GraphConvTensorGraph` that wraps a standard graph convolutional architecture underneath the hood for user convenience. Let's instantiate an object of this class and train it on our dataset.

%% Cell type:code id: tags:

``` python
model = GraphConvTensorGraph(
    len(tox21_tasks), batch_size=50, mode='classification')
# Set nb_epoch=10 for better results.
model.fit(train_dataset, nb_epoch=1)
```

%% Output

    /home/rbharath/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:91: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
    /home/leswing/miniconda3/envs/deepchem/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py:95: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
      "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

    Starting epoch 0
    Ending global_step 129: Average loss 586.57
    TIMING: model fitting took 17.082 s
    Ending global_step 126: Average loss 590.739
    TIMING: model fitting took 7.317 s

    590.7391258118645

%% Cell type:markdown id: tags:

Let's try to evaluate the performance of the model we've trained. For this, we need to define a metric, a measure of model performance. `dc.metrics` holds a collection of metrics already. For this dataset, it is standard to use the ROC-AUC score, the area under the receiver operating characteristic curve (which measures the tradeoff between precision and recall). Luckily, the ROC-AUC score is already available in DeepChem.

To measure the performance of the model under this metric, we can use the convenience function `model.evaluate()`.

%% Cell type:code id: tags:

``` python
metric = dc.metrics.Metric(
    dc.metrics.roc_auc_score, np.mean, mode="classification")

print("Evaluating model")
train_scores = model.evaluate(train_dataset, [metric], transformers)
print("Training ROC-AUC Score: %f" % train_scores["mean-roc_auc_score"])
valid_scores = model.evaluate(valid_dataset, [metric], transformers)
print("Validation ROC-AUC Score: %f" % valid_scores["mean-roc_auc_score"])
```

%% Output

    Evaluating model
    computed_metrics: [0.82355634419908874, 0.84317769495469375, 0.83831967490229498, 0.78265073415417696, 0.67514582634690545, 0.78648566994877722, 0.71947133017380849, 0.68636368747841314, 0.77653631284916202, 0.66689869847282113, 0.77872709187364375, 0.74727748235240998]
    Training ROC-AUC Score: 0.760384
    computed_metrics: [0.76064908722109537, 0.82394038192827201, 0.83087572150072142, 0.7314288959563835, 0.60984975330966895, 0.60837196235426316, 0.59852764937510705, 0.65035114358925683, 0.71262721074992585, 0.58264411027568919, 0.71303258145363402, 0.66946153846153844]
    Validation ROC-AUC Score: 0.690980
    computed_metrics: [0.80045699830862893, 0.83618637604367374, 0.83908539936708681, 0.77873855933094183, 0.67692252993044244, 0.75578036941489168, 0.75895796821704797, 0.70234314980793855, 0.76387081283102387, 0.65924917162534913, 0.78448201448364341, 0.76675448900822296]
    Training ROC-AUC Score: 0.760236
    computed_metrics: [0.71533171721169553, 0.74090608465608465, 0.81106357802757933, 0.70627859684799188, 0.63177272727272715, 0.6326016835811501, 0.61491865697473169, 0.71286314850043442, 0.67676006592889104, 0.51656328658755846, 0.75414979999520937, 0.6603359173126615]
    Validation ROC-AUC Score: 0.681129

%% Cell type:markdown id: tags:

What's going on under the hood? Could we build `GraphConvTensorGraph` ourselves? Of course! The first step is to create a `TensorGraph` object. This object will hold the "computational graph" that defines the computation that a graph convolutional network will perform.

%% Cell type:code id: tags:

``` python
from deepchem.models.tensorgraph.tensor_graph import TensorGraph

tg = TensorGraph(use_queue=False)
```

%% Cell type:markdown id: tags:

Let's now define the inputs to our model. Conceptually, graph convolutions just requires a the structure of the molecule in question and a vector of features for every atom that describes the local chemical environment. However in practice, due to TensorFlow's limitations as a general programming environment, we have to have some auxiliary information as well preprocessed.

`atom_features` holds a feature vector of length 75 for each atom. The other feature inputs are required to support minibatching in TensorFlow. `degree_slice` is an indexing convenience that makes it easy to locate atoms from all molecules with a given degree. `membership` determines the membership of atoms in molecules (atom `i` belongs to molecule `membership[i]`). `deg_adjs` is a list that contains adjacency lists grouped by atom degree For more details, check out the [code](https://github.com/deepchem/deepchem/blob/master/deepchem/feat/mol_graphs.py).

To define feature inputs in `TensorGraph`, we use the `Feature` layer. Conceptually, a `TensorGraph` is a mathematical graph composed of layer objects. `Features` layers have to be the root nodes of the graph since they consitute inputs.

%% Cell type:code id: tags:

``` python
from deepchem.models.tensorgraph.layers import Feature

atom_features = Feature(shape=(None, 75))
degree_slice = Feature(shape=(None, 2), dtype=tf.int32)
membership = Feature(shape=(None,), dtype=tf.int32)

deg_adjs = []
for i in range(0, 10 + 1):
    deg_adj = Feature(shape=(None, i + 1), dtype=tf.int32)
    deg_adjs.append(deg_adj)
```

%% Cell type:markdown id: tags:

Let's now implement the body of the graph convolutional network. `TensorGraph` has a number of layers that encode various graph operations. Namely, the `GraphConv`, `GraphPool` and `GraphGather` layers. We will also apply standard neural network layers such as `Dense` and `BatchNorm`.

The layers we're adding effect a "feature transformation" that will create one vector for each molecule.

%% Cell type:code id: tags:

``` python
from deepchem.models.tensorgraph.layers import Dense, GraphConv, BatchNorm
from deepchem.models.tensorgraph.layers import GraphPool, GraphGather

batch_size = 50

gc1 = GraphConv(
    64,
    activation_fn=tf.nn.relu,
    in_layers=[atom_features, degree_slice, membership] + deg_adjs)
batch_norm1 = BatchNorm(in_layers=[gc1])
gp1 = GraphPool(in_layers=[batch_norm1, degree_slice, membership] + deg_adjs)
gc2 = GraphConv(
    64,
    activation_fn=tf.nn.relu,
    in_layers=[gp1, degree_slice, membership] + deg_adjs)
batch_norm2 = BatchNorm(in_layers=[gc2])
gp2 = GraphPool(in_layers=[batch_norm2, degree_slice, membership] + deg_adjs)
dense = Dense(out_channels=128, activation_fn=tf.nn.relu, in_layers=[gp2])
batch_norm3 = BatchNorm(in_layers=[dense])
readout = GraphGather(
    batch_size=batch_size,
    activation_fn=tf.nn.tanh,
    in_layers=[batch_norm3, degree_slice, membership] + deg_adjs)
```

%% Cell type:markdown id: tags:

Let's now make predictions from the `TensorGraph` model. Tox21 is a multitask dataset. That is, there are 12 different datasets grouped together, which share many common molecules, but with different outputs for each. As a result, we have to add a separate output layer for each task. We will use a `for` loop over the `tox21_tasks` list to make this happen. We need to add labels for each

We also have to define a loss for the model which tells the network the objective to minimize during training.

We have to tell `TensorGraph` which layers are outputs with `TensorGraph.add_output(layer)`. Similarly, we tell the network its loss with `TensorGraph.set_loss(loss)`.

%% Cell type:code id: tags:

``` python
from deepchem.models.tensorgraph.layers import Dense, SoftMax, \
    SoftMaxCrossEntropy, WeightedError, Concat
    SoftMaxCrossEntropy, WeightedError, Stack
from deepchem.models.tensorgraph.layers import Label, Weights

costs = []
labels = []
for task in range(len(tox21_tasks)):
    classification = Dense(
        out_channels=2, activation_fn=None, in_layers=[readout])

    softmax = SoftMax(in_layers=[classification])
    tg.add_output(softmax)

    label = Label(shape=(None, 2))
    labels.append(label)
    cost = SoftMaxCrossEntropy(in_layers=[label, classification])
    costs.append(cost)
all_cost = Concat(in_layers=costs, axis=1)
all_cost = Stack(in_layers=costs, axis=1)
weights = Weights(shape=(None, len(tox21_tasks)))
loss = WeightedError(in_layers=[all_cost, weights])
tg.set_loss(loss)
```

%% Cell type:markdown id: tags:

Now that we've successfully defined our graph convolutional model in `TensorGraph`, we need to train it. We can call `fit()`, but we need to make sure that each minibatch of data populates all four `Feature` objects that we've created. For this, we need to create a Python generator that given a batch of data generates a dictionary whose keys are the `Feature` layers and whose values are Numpy arrays we'd like to use for this step of training.

%% Cell type:code id: tags:

``` python
from deepchem.metrics import to_one_hot
from deepchem.feat.mol_graphs import ConvMol

def data_generator(dataset, epochs=1, predict=False, pad_batches=True):
  for epoch in range(epochs):
    if not predict:
        print('Starting epoch %i' % epoch)
    for ind, (X_b, y_b, w_b, ids_b) in enumerate(
        dataset.iterbatches(
            batch_size, pad_batches=True, deterministic=True)):
            batch_size, pad_batches=pad_batches, deterministic=True)):
      d = {}
      for index, label in enumerate(labels):
        d[label] = to_one_hot(y_b[:, index])
      d[weights] = w_b
      multiConvMol = ConvMol.agglomerate_mols(X_b)
      d[atom_features] = multiConvMol.get_atom_features()
      d[degree_slice] = multiConvMol.deg_slice
      d[membership] = multiConvMol.membership
      for i in range(1, len(multiConvMol.get_deg_adjacency_lists())):
        d[deg_adjs[i - 1]] = multiConvMol.get_deg_adjacency_lists()[i]
      yield d
```

%% Cell type:markdown id: tags:

Now, we can train the model using `TensorGraph.fit_generator(generator)` which will use the generator we've defined to train the model.

%% Cell type:code id: tags:

``` python
# Epochs set to 1 to render tutorials online.
# Set epochs=10 for better results.
tg.fit_generator(data_generator(train_dataset, epochs=1))
```

%% Output

    Starting epoch 0
    Ending global_step 129: Average loss 586.139
    TIMING: model fitting took 21.158 s
    Ending global_step 251: Average loss 530.84
    TIMING: model fitting took 6.949 s

    530.8396410260882

%% Cell type:markdown id: tags:

Now that we have trained our graph convolutional method, let's evaluate its performance. We again have to use our defined generator to evaluate model performance.

%% Cell type:code id: tags:

``` python
metric = dc.metrics.Metric(
    dc.metrics.roc_auc_score, np.mean, mode="classification")

def reshape_y_pred(y_true, y_pred):
    """
    TensorGraph.Predict returns a list of arrays, one for each output
    We also have to remove the padding on the last batch
    Metrics taks results of shape (samples, n_task, prob_of_class)
    """
    n_samples = len(y_true)
    retval = np.stack(y_pred, axis=1)
    return retval[:n_samples]


print("Evaluating model")
train_scores = tg.evaluate_generator(data_generator(train_dataset, predict=True),
                                     [metric], labels=labels, weights=[weights])
print("Training ROC-AUC Score: %f" % train_scores["mean-roc_auc_score"])
valid_scores = tg.evaluate_generator(data_generator(valid_dataset, predict=True),
                                     [metric], labels=labels, weights=[weights])
print("Valid ROC-AUC Score: %f" % valid_scores["mean-roc_auc_score"])
train_predictions = tg.predict_on_generator(data_generator(train_dataset, predict=True))
train_predictions = reshape_y_pred(train_dataset.y, train_predictions)
train_scores = metric.compute_metric(train_dataset.y, train_predictions, train_dataset.w)
print("Training ROC-AUC Score: %f" % train_scores)

valid_predictions = tg.predict_on_generator(data_generator(valid_dataset, predict=True))
valid_predictions = reshape_y_pred(valid_dataset.y, valid_predictions)
valid_scores = metric.compute_metric(valid_dataset.y, valid_predictions, valid_dataset.w)
print("Valid ROC-AUC Score: %f" % valid_scores)
```

%% Output

    Evaluating model
    computed_metrics: [0.48489674611520039, 0.49062127172873471, 0.49497985043341852, 0.50670994717906093, 0.47636546814871089, 0.50538527032779901, 0.47652531174729351, 0.49406973662015952, 0.49345587239947747, 0.51160175820338205, 0.48764342697532359, 0.505495556795303]
    Training ROC-AUC Score: 0.493979
    computed_metrics: [0.49546025306674396, 0.44814469802825652, 0.51528679653679654, 0.50477055883689226, 0.50335989998437258, 0.5068829891838742, 0.57539376819037835, 0.50612128689380276, 0.45944076672265588, 0.58692564745196329, 0.48453889353012158, 0.48315384615384616]
    Valid ROC-AUC Score: 0.505790
    computed_metrics: [0.83463194036351052, 0.86218739964675661, 0.84894031662657832, 0.80217986671584707, 0.70559942152332189, 0.79751934844253025, 0.8103057689046107, 0.71659210162938414, 0.80849247997327445, 0.72071717294380933, 0.83433314746710274, 0.78304357554399506]
    Training ROC-AUC Score: 0.793712
    computed_metrics: [0.78221936377578793, 0.78993055555555547, 0.81705388431256543, 0.77777071682765631, 0.66802272727272727, 0.67197702777122181, 0.64295604015230179, 0.72305596655628368, 0.74692724275959499, 0.63050611290902547, 0.80023473616134511, 0.73880275624461667]
    Valid ROC-AUC Score: 0.732455

%% Cell type:markdown id: tags:

Success! The model we've constructed behaves nearly identically to `GraphConvTensorGraph`. If you're looking to build your own custom models, you can follow the example we've provided here to do so. We hope to see exciting constructions from your end soon!

%% Cell type:code id: tags:

``` python
```