Commit 15a1cf1b authored by Arun's avatar Arun
Browse files

moved asset to assets [skip ci]

parent 57a0a5a7
Loading
Loading
Loading
Loading
+5 −5
Original line number Diff line number Diff line
@@ -14,7 +14,7 @@
    "\n",
    "Mariia Matveieva, Pavel Polishchuk. Institute of Molecular and Translational Medicine, Palacky University, Olomouc, Czech Republic.\n",
    "\n",
    "<img src=\"atomic_contributions_tutorial_data/index.png\">\n",
    "<img src=\"assets/atomic_contributions_tutorial_data/index.png\">\n",
    "\n",
    "## Colab\n",
    "\n",
@@ -80,7 +80,7 @@
    "from rdkit.Chem.Draw import SimilarityMaps\n",
    "import tensorflow as tf\n",
    "\n",
    "DATASET_FILE ='atomic_contributions_tutorial_data/logBB.sdf'\n",
    "DATASET_FILE ='assets/atomic_contributions_tutorial_data/logBB.sdf'\n",
    "# Create RDKit mol objects, since we will need them later.\n",
    "mols = [m for m in Chem.SDMolSupplier(DATASET_FILE) if m is not None ]\n",
    "loader = dc.data.SDFLoader(tasks=[\"logBB_class\"], \n",
@@ -152,7 +152,7 @@
    }
   ],
   "source": [
    "TEST_DATASET_FILE = 'atomic_contributions_tutorial_data/logBB_test_.sdf'\n",
    "TEST_DATASET_FILE = 'assets/atomic_contributions_tutorial_data/logBB_test_.sdf'\n",
    "loader = dc.data.SDFLoader(tasks=[\"p_np\"], sanitize=True,\n",
    "                           featurizer=dc.feat.ConvMolFeaturizer())\n",
    "test_dataset = loader.create_dataset(TEST_DATASET_FILE, shard_size=2000)\n",
@@ -619,7 +619,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
    "DATASET_FILE ='atomic_contributions_tutorial_data/Tetrahymena_pyriformis_Work_set_OCHEM.sdf'\n",
    "DATASET_FILE ='assets/atomic_contributions_tutorial_data/Tetrahymena_pyriformis_Work_set_OCHEM.sdf'\n",
    "# create RDKit mol objects, we will need them later\n",
    "mols = [m for m in Chem.SDMolSupplier(DATASET_FILE) if m is not None ]\n",
    "loader = dc.data.SDFLoader(tasks=[\"IGC50\"], \n",
@@ -683,7 +683,7 @@
    }
   ],
   "source": [
    "TEST_DATASET_FILE = 'atomic_contributions_tutorial_data/Tetrahymena_pyriformis_Test_set_OCHEM.sdf'\n",
    "TEST_DATASET_FILE = 'assets/atomic_contributions_tutorial_data/Tetrahymena_pyriformis_Test_set_OCHEM.sdf'\n",
    "loader = dc.data.SDFLoader(tasks=[\"IGC50\"], sanitize= True,\n",
    "                           featurizer=dc.feat.ConvMolFeaturizer())\n",
    "test_dataset = loader.create_dataset(TEST_DATASET_FILE, shard_size=2000)\n",
+1 −1
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

#  Introduction to Graph Convolutions

In this tutorial we will learn more about "graph convolutions." These are one of the most powerful deep learning tools for working with molecular data. The reason for this is that molecules can be naturally viewed as graphs.

![Molecular Graph](https://github.com/deepchem/deepchem/blob/master/examples/tutorials/basic_graphs.gif?raw=1)
![Molecular Graph](https://github.com/deepchem/deepchem/blob/master/examples/tutorials/assets/basic_graphs.gif?raw=1)

Note how standard chemical diagrams of the sort we're used to from high school lend themselves naturally to visualizing molecules as graphs. In the remainder of this tutorial, we'll dig into this relationship in significantly more detail. This will let us get a deeper understanding of how these systems work.

## Colab

This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/Introduction_to_Graph_Convolutions.ipynb)

## Setup

To run DeepChem within Colab, you'll need to run the following installation commands. This will take about 5 minutes to run to completion and install your environment. You can of course run this tutorial locally if you prefer. In that case, don't run these cells since they will download and install Anaconda on your local machine.

%% Cell type:code id: tags:

``` python
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e
```

%% Cell type:code id: tags:

``` python
!pip install --pre deepchem
```

%% Cell type:markdown id: tags:

# What are Graph Convolutions?

Consider a standard convolutional neural network (CNN) of the sort commonly used to process images.  The input is a grid of pixels.  There is a vector of data values for each pixel, for example the red, green, and blue color channels.  The data passes through a series of convolutional layers.  Each layer combines the data from a pixel and its neighbors to produce a new data vector for the pixel.  Early layers detect small scale local patterns, while later layers detect larger, more abstract patterns.  Often the convolutional layers alternate with pooling layers that perform some operation such as max or min over local regions.

Graph convolutions are similar, but they operate on a graph.  They begin with a data vector for each node of the graph (for example, the chemical properties of the atom that node represents).  Convolutional and pooling layers combine information from connected nodes (for example, atoms that are bonded to each other) to produce a new data vector for each node.

# Training a GraphConvModel

Let's use the MoleculeNet suite to load the Tox21 dataset. To featurize the data in a way that graph convolutional networks can use, we set the featurizer option to `'GraphConv'`. The MoleculeNet call returns a training set, a validation set, and a test set for us to use. It also returns `tasks`, a list of the task names, and `transformers`, a list of data transformations that were applied to preprocess the dataset. (Most deep networks are quite finicky and require a set of data transformations to ensure that training proceeds stably.)

%% Cell type:code id: tags:

``` python
import deepchem as dc

tasks, datasets, transformers = dc.molnet.load_tox21(featurizer='GraphConv')
train_dataset, valid_dataset, test_dataset = datasets
```

%% Cell type:markdown id: tags:

Let's now train a graph convolutional network on this dataset. DeepChem has the class `GraphConvModel` that wraps a standard graph convolutional architecture underneath the hood for user convenience. Let's instantiate an object of this class and train it on our dataset.

%% Cell type:code id: tags:

``` python
n_tasks = len(tasks)
model = dc.models.GraphConvModel(n_tasks, mode='classification')
model.fit(train_dataset, nb_epoch=50)
```

%% Output

    0.28185401916503905

%% Cell type:markdown id: tags:

Let's try to evaluate the performance of the model we've trained. For this, we need to define a metric, a measure of model performance. `dc.metrics` holds a collection of metrics already. For this dataset, it is standard to use the ROC-AUC score, the area under the receiver operating characteristic curve (which measures the tradeoff between precision and recall). Luckily, the ROC-AUC score is already available in DeepChem.

To measure the performance of the model under this metric, we can use the convenience function `model.evaluate()`.

%% Cell type:code id: tags:

``` python
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
print('Training set score:', model.evaluate(train_dataset, [metric], transformers))
print('Test set score:', model.evaluate(test_dataset, [metric], transformers))
```

%% Output

    Training set score: {'roc_auc_score': 0.96959686893055}
    Test set score: {'roc_auc_score': 0.795793783300876}

%% Cell type:markdown id: tags:

The results are pretty good, and `GraphConvModel` is very easy to use. But what's going on under the hood? Could we build GraphConvModel ourselves? Of course! DeepChem provides Keras layers for all the calculations involved in a graph convolution. We are going to apply the following layers from DeepChem.

-  `GraphConv` layer: This layer implements the graph convolution. The graph convolution combines per-node feature vectures in a nonlinear fashion with the feature vectors for neighboring nodes.  This "blends" information in local neighborhoods of a graph.

- `GraphPool` layer: This layer does a max-pooling over the feature vectors of atoms in a neighborhood. You can think of this layer as analogous to a max-pooling layer for 2D convolutions but which operates on graphs instead.

- `GraphGather`: Many graph convolutional networks manipulate feature vectors per graph-node. For a molecule for example, each node might represent an atom, and the network would manipulate atomic feature vectors that summarize the local chemistry of the atom. However, at the end of the application, we will likely want to work with a molecule level feature representation. This layer creates a graph level feature vector by combining all the node-level feature vectors.

Apart from this we are going to apply standard neural network layers such as [Dense](https://keras.io/api/layers/core_layers/dense/), [BatchNormalization](https://keras.io/api/layers/normalization_layers/batch_normalization/) and [Softmax](https://keras.io/api/layers/activation_layers/softmax/) layer.

%% Cell type:code id: tags:

``` python
from deepchem.models.layers import GraphConv, GraphPool, GraphGather
import tensorflow as tf
import tensorflow.keras.layers as layers

batch_size = 100

class MyGraphConvModel(tf.keras.Model):

  def __init__(self):
    super(MyGraphConvModel, self).__init__()
    self.gc1 = GraphConv(128, activation_fn=tf.nn.tanh)
    self.batch_norm1 = layers.BatchNormalization()
    self.gp1 = GraphPool()

    self.gc2 = GraphConv(128, activation_fn=tf.nn.tanh)
    self.batch_norm2 = layers.BatchNormalization()
    self.gp2 = GraphPool()

    self.dense1 = layers.Dense(256, activation=tf.nn.tanh)
    self.batch_norm3 = layers.BatchNormalization()
    self.readout = GraphGather(batch_size=batch_size, activation_fn=tf.nn.tanh)

    self.dense2 = layers.Dense(n_tasks*2)
    self.logits = layers.Reshape((n_tasks, 2))
    self.softmax = layers.Softmax()

  def call(self, inputs):
    gc1_output = self.gc1(inputs)
    batch_norm1_output = self.batch_norm1(gc1_output)
    gp1_output = self.gp1([batch_norm1_output] + inputs[1:])

    gc2_output = self.gc2([gp1_output] + inputs[1:])
    batch_norm2_output = self.batch_norm1(gc2_output)
    gp2_output = self.gp2([batch_norm2_output] + inputs[1:])

    dense1_output = self.dense1(gp2_output)
    batch_norm3_output = self.batch_norm3(dense1_output)
    readout_output = self.readout([batch_norm3_output] + inputs[1:])

    logits_output = self.logits(self.dense2(readout_output))
    return self.softmax(logits_output)
```

%% Cell type:markdown id: tags:

We can now see more clearly what is happening.  There are two convolutional blocks, each consisting of a `GraphConv`, followed by batch normalization, followed by a `GraphPool` to do max pooling.  We finish up with a dense layer, another batch normalization, a `GraphGather` to combine the data from all the different nodes, and a final dense layer to produce the global output.

Let's now create the DeepChem model which will be a wrapper around the Keras model that we just created. We will also specify the loss function so the model know the objective to minimize.

%% Cell type:code id: tags:

``` python
model = dc.models.KerasModel(MyGraphConvModel(), loss=dc.models.losses.CategoricalCrossEntropy())
```

%% Cell type:markdown id: tags:

What are the inputs to this model?  A graph convolution requires a complete description of each molecule, including the list of nodes (atoms) and a description of which ones are bonded to each other.  In fact, if we inspect the dataset we see that the feature array contains Python objects of type `ConvMol`.

%% Cell type:code id: tags:

``` python
test_dataset.X[0]
```

%% Output

    <deepchem.feat.mol_graphs.ConvMol at 0x14d0b1650>

%% Cell type:markdown id: tags:

Models expect arrays of numbers as their inputs, not Python objects.  We must convert the `ConvMol` objects into the particular set of arrays expected by the `GraphConv`, `GraphPool`, and `GraphGather` layers.  Fortunately, the `ConvMol` class includes the code to do this, as well as to combine all the molecules in a batch to create a single set of arrays.

The following code creates a Python generator that given a batch of data generates the lists of inputs, labels, and weights whose values are Numpy arrays. `atom_features` holds a feature vector of length 75 for each atom. The other inputs are required to support minibatching in TensorFlow. `degree_slice` is an indexing convenience that makes it easy to locate atoms from all molecules with a given degree. `membership` determines the membership of atoms in molecules (atom `i` belongs to molecule `membership[i]`). `deg_adjs` is a list that contains adjacency lists grouped by atom degree. For more details, check out the [code](https://github.com/deepchem/deepchem/blob/master/deepchem/feat/mol_graphs.py).

%% Cell type:code id: tags:

``` python
from deepchem.metrics import to_one_hot
from deepchem.feat.mol_graphs import ConvMol
import numpy as np

def data_generator(dataset, epochs=1):
  for ind, (X_b, y_b, w_b, ids_b) in enumerate(dataset.iterbatches(batch_size, epochs,
                                                                   deterministic=False, pad_batches=True)):
    multiConvMol = ConvMol.agglomerate_mols(X_b)
    inputs = [multiConvMol.get_atom_features(), multiConvMol.deg_slice, np.array(multiConvMol.membership)]
    for i in range(1, len(multiConvMol.get_deg_adjacency_lists())):
      inputs.append(multiConvMol.get_deg_adjacency_lists()[i])
    labels = [to_one_hot(y_b.flatten(), 2).reshape(-1, n_tasks, 2)]
    weights = [w_b]
    yield (inputs, labels, weights)
```

%% Cell type:markdown id: tags:

Now, we can train the model using `fit_generator(generator)` which will use the generator we've defined to train the model.

%% Cell type:code id: tags:

``` python
model.fit_generator(data_generator(train_dataset, epochs=50))
```

%% Output

    0.21941944122314452

%% Cell type:markdown id: tags:

Now that we have trained our graph convolutional method, let's evaluate its performance. We again have to use our defined generator to evaluate model performance.

%% Cell type:code id: tags:

``` python
print('Training set score:', model.evaluate_generator(data_generator(train_dataset), [metric], transformers))
print('Test set score:', model.evaluate_generator(data_generator(test_dataset), [metric], transformers))
```

%% Output

    Training set score: {'roc_auc_score': 0.8425638289185731}
    Test set score: {'roc_auc_score': 0.7378436684114341}

%% Cell type:markdown id: tags:

Success! The model we've constructed behaves nearly identically to `GraphConvModel`. If you're looking to build your own custom models, you can follow the example we've provided here to do so. We hope to see exciting constructions from your end soon!

%% Cell type:markdown id: tags:

# Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

## Join the DeepChem Gitter
The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!
+1 −1
Original line number Diff line number Diff line
@@ -23,7 +23,7 @@
    "\n",
    "[LIME](https://homes.cs.washington.edu/~marcotcr/blog/lime/) is a tool which can help with this problem.  It uses local perturbations of feature space to determine feature importance. In this tutorial, you'll learn how to use LIME alongside DeepChem to interpret what it is our models are learning. \n",
    "\n",
    "![Selection_110.png](https://github.com/deepchem/deepchem/blob/master/examples/tutorials/lime_dog.png?raw=1)\n",
    "![Selection_110.png](https://github.com/deepchem/deepchem/blob/master/examples/tutorials/assets/lime_dog.png?raw=1)\n",
    "\n",
    "So if this tool can work in human understandable ways for images can it work on molecules?  In this tutorial you will learn how to use LIME for model interpretability for any of our fixed-length featurization models.\n",
    "\n",
+1 −1
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

#  Using Reinforcement Learning to Play Pong

This tutorial demonstrates using reinforcement learning to train an agent to play Pong.  This task isn't directly related to chemistry, but video games make an excellent demonstration of reinforcement learning techniques.

![title](pong.png)
![title](assets/pong.png)

## Colab

This tutorial and the rest in this sequence can be done in Google Colab (although the visualization at the end doesn't work correctly on Colab, so you might prefer to run this tutorial locally). If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/Using_Reinforcement_Learning_to_Play_Pong.ipynb)

## Setup

To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment. To install `gym` you should also use `pip install 'gym[atari]'` (We need the extra modifier since we'll be using an atari game). We'll add this command onto our usual Colab installation commands for you

%% Cell type:code id: tags:

``` python
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e
```

%% Cell type:code id: tags:

``` python
!pip install --pre deepchem
import deepchem
deepchem.__version__
```

%% Cell type:code id: tags:

``` python
!pip install 'gym[atari]'
```

%% Cell type:markdown id: tags:

## Reinforcement Learning

Reinforcement learning involves an *agent* that interacts with an *environment*.  In this case, the environment is the video game and the agent is the player.  By trial and error, the agent learns a *policy* that it follows to perform some task (winning the game).  As it plays, it receives *rewards* that give it feedback on how well it is doing.  In this case, it receives a positive reward every time it scores a point and a negative reward every time the other player scores a point.

The first step is to create an `Environment` that implements this task.  Fortunately,
OpenAI Gym already provides an implementation of Pong (and many other tasks appropriate
for reinforcement learning).  DeepChem's `GymEnvironment` class provides an easy way to
use environments from OpenAI Gym.  We could just use it directly, but in this case we
subclass it and preprocess the screen image a little bit to make learning easier.

%% Cell type:code id: tags:

``` python
import deepchem as dc
import numpy as np

class PongEnv(dc.rl.GymEnvironment):
  def __init__(self):
    super(PongEnv, self).__init__('Pong-v0')
    self._state_shape = (80, 80)

  @property
  def state(self):
    # Crop everything outside the play area, reduce the image size,
    # and convert it to black and white.
    cropped = np.array(self._state)[34:194, :, :]
    reduced = cropped[0:-1:2, 0:-1:2]
    grayscale = np.sum(reduced, axis=2)
    bw = np.zeros(grayscale.shape)
    bw[grayscale != 233] = 1
    return bw

  def __deepcopy__(self, memo):
    return PongEnv()

env = PongEnv()
```

%% Cell type:markdown id: tags:

Next we create a model to implement our policy.  This model receives the current state of the environment (the pixels being displayed on the screen at this moment) as its input.  Given that input, it decides what action to perform.  In Pong there are three possible actions at any moment: move the paddle up, move it down, or leave it where it is.  The policy model produces a probability distribution over these actions.  It also produces a *value* output, which is interpreted as an estimate of how good the current state is.  This turns out to be important for efficient learning.

The model begins with two convolutional layers to process the image.  That is followed by a dense (fully connected) layer to provide plenty of capacity for game logic.  We also add a small Gated Recurrent Unit (GRU).  That gives the network a little bit of memory, so it can keep track of which way the ball is moving.  Just from the screen image, you cannot tell whether the ball is moving to the left or to the right, so having memory is important.

We concatenate the dense and GRU outputs together, and use them as inputs to two final layers that serve as the
network's outputs.  One computes the action probabilities, and the other computes an estimate of the
state value function.

We also provide an input for the initial state of the GRU, and return its final state at the end.  This is required by the learning algorithm.

%% Cell type:code id: tags:

``` python
import tensorflow as tf
from tensorflow.keras.layers import Input, Concatenate, Conv2D, Dense, Flatten, GRU, Reshape

class PongPolicy(dc.rl.Policy):
    def __init__(self):
        super(PongPolicy, self).__init__(['action_prob', 'value', 'rnn_state'], [np.zeros(16)])

    def create_model(self, **kwargs):
        state = Input(shape=(80, 80))
        rnn_state = Input(shape=(16,))
        conv1 = Conv2D(16, kernel_size=8, strides=4, activation=tf.nn.relu)(Reshape((80, 80, 1))(state))
        conv2 = Conv2D(32, kernel_size=4, strides=2, activation=tf.nn.relu)(conv1)
        dense = Dense(256, activation=tf.nn.relu)(Flatten()(conv2))
        gru, rnn_final_state = GRU(16, return_state=True, return_sequences=True, time_major=True)(
            Reshape((-1, 256))(dense), initial_state=rnn_state)
        concat = Concatenate()([dense, Reshape((16,))(gru)])
        action_prob = Dense(env.n_actions, activation=tf.nn.softmax)(concat)
        value = Dense(1)(concat)
        return tf.keras.Model(inputs=[state, rnn_state], outputs=[action_prob, value, rnn_final_state])

policy = PongPolicy()
```

%% Cell type:markdown id: tags:

We will optimize the policy using the Advantage Actor Critic (A2C) algorithm.  There are lots of hyperparameters we could specify at this point, but the default values for most of them work well on this problem.  The only one we need to customize is the learning rate.

%% Cell type:code id: tags:

``` python
from deepchem.models.optimizers import Adam
a2c = dc.rl.A2C(env, policy, model_dir='model', optimizer=Adam(learning_rate=0.0002))
```

%% Cell type:markdown id: tags:

Optimize for as long as you have patience to.  By 1 million steps you should see clear signs of learning.  Around 3 million steps it should start to occasionally beat the game's built in AI.  By 7 million steps it should be winning almost every time.  Running on my laptop, training takes about 20 minutes for every million steps.

%% Cell type:code id: tags:

``` python
# Change this to train as many steps as you have patience for.
a2c.fit(1000)
```

%% Cell type:markdown id: tags:

Let's watch it play and see how it does!

%% Cell type:code id: tags:

``` python
# This code doesn't work well on Colab
env.reset()
while not env.terminated:
    env.env.render()
    env.step(a2c.select_action(env.state))
```

%% Cell type:markdown id: tags:

# Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

## Join the DeepChem Gitter
The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!
+0 −0

File moved.

Loading