Unverified Commit e5f0359e authored by Bharath Ramsundar's avatar Bharath Ramsundar Committed by GitHub
Browse files

Merge pull request #2203 from peastman/tutorials

More updates to tutorial sequence
parents 0d3f2285 dd5dc438
Loading
Loading
Loading
Loading
+0 −125
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Tutorial Part 2: Learning MNIST Digit Classifiers

In the previous tutorial, we learned some basics of how to load data into DeepChem and how to use the basic DeepChem objects to load and manipulate this data. In this tutorial, you'll put the parts together and learn how to train a basic image classification model in DeepChem. You might ask, why are we bothering to learn this material in DeepChem? Part of the reason is that image processing is an increasingly important part of AI for the life sciences. So learning how to train image processing models will be very useful for using some of the more advanced DeepChem features.

The MNIST dataset contains handwritten digits along with their human annotated labels. The learning challenge for this dataset is to train a model that maps the digit image to its true label. MNIST has been a standard benchmark for machine learning for decades at this point.

![MNIST](https://github.com/deepchem/deepchem/blob/master/examples/tutorials/mnist_examples.png?raw=1)

## Colab

This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/02_Learning_MNIST_Digit_Classifiers.ipynb)

## Setup

We recommend running this tutorial on Google colab. You'll need to run the following cell of installation commands on Colab to get your environment set up. If you'd rather run the tutorial locally, make sure you don't run these commands (since they'll download and install a new Anaconda python setup)

%% Cell type:code id: tags:

``` python
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e
```

%% Cell type:code id: tags:

``` python
!pip install --pre deepchem
import deepchem
deepchem.__version__
```

%% Cell type:markdown id: tags:

First let's import the libraries we will be using and load the data (which comes bundled with Tensorflow).

%% Cell type:code id: tags:

``` python
import deepchem as dc
import tensorflow as tf
import numpy as np
from tensorflow.keras.layers import Reshape, Conv2D, Flatten, Dense

mnist = tf.keras.datasets.mnist.load_data(path='mnist.npz')
train_images = mnist[0][0].reshape((-1, 28, 28, 1))/255
valid_images = mnist[1][0].reshape((-1, 28, 28, 1))/255
train = dc.data.NumpyDataset(train_images, mnist[0][1])
valid = dc.data.NumpyDataset(valid_images, mnist[1][1])
```

%% Cell type:markdown id: tags:

Now create the model.  We use two convolutional layers followed by two dense layers.  The final layer outputs ten numbers for each sample.  These correspond to the ten possible digits.

How does the model know how to interpret the output?  That is determined by the loss function.  We specify `SparseSoftmaxCrossEntropy`.  This is a very convenient class that implements a common case:

1. Each label is an integer which is interpreted as a class index (i.e. which of the ten digits this sample is a drawing of).
2. The outputs are passed through a softmax function, and the result is interpreted as a probability distribution over those same classes.

The model learns to produce a large output for the correct class, and small outputs for all other classes.

%% Cell type:code id: tags:

``` python
keras_model = tf.keras.Sequential([
    Conv2D(filters=32, kernel_size=5, activation=tf.nn.relu),
    Conv2D(filters=64, kernel_size=5, activation=tf.nn.relu),
    Flatten(),
    Dense(1024, activation=tf.nn.relu),
    Dense(10),
])
model = dc.models.KerasModel(keras_model, dc.models.losses.SparseSoftmaxCrossEntropy())
```

%% Cell type:markdown id: tags:

Fit the model on the training set.

%% Cell type:code id: tags:

``` python
model.fit(train, nb_epoch=2)
```

%% Output

    0.031744494438171386

%% Cell type:markdown id: tags:

Let's see how well it works.  We ask the model to predict the class of every sample in the validation set.  Remember there are ten outputs for each sample.  We use `argmax()` to identify the largest one, which corresponds to the predicted class.

%% Cell type:code id: tags:

``` python
prediction = np.argmax(model.predict_on_batch(valid.X), axis=1)
score = dc.metrics.accuracy_score(prediction, valid.y)
print('Validation set accuracy: ', score)
```

%% Output

    Validation set accuracy:  0.9891

%% Cell type:markdown id: tags:

It gets about 99% of samples correct.  Not too bad for such a simple model!

%% Cell type:markdown id: tags:

# Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

## Join the DeepChem Gitter
The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!
+0 −1334

File deleted.

Preview size limit exceeded, changes collapsed.

+25 −28
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Tutorial Part 5: Putting Multitask Learning to Work
# Tutorial Part 11: Putting Multitask Learning to Work

This notebook walks through the creation of multitask models on MUV [1]. The goal is to demonstrate how multitask methods can provide improved performance in situations with little or very unbalanced data.

## Colab

This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/05_Putting_Multitask_Learning_to_Work.ipynb)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/11_Putting_Multitask_Learning_to_Work.ipynb)


## Setup

To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.

%% Cell type:code id: tags:

``` python
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e
```

%% Cell type:code id: tags:

``` python
!pip install --pre deepchem
import deepchem
deepchem.__version__
```

%% Cell type:markdown id: tags:

The MUV dataset is a challenging benchmark in molecular design that consists of 17 different "targets" where there are only a few "active" compounds per target. There are 93,087 compounds in total, yet no task has more than 30 active compounds, and many have even less. Training a model with such a small number of positive examples is very challenging.  Multitask models address this by training a single model that predicts all the different targets at once. If a feature is useful for predicting one task, it often is useful for predicting several other tasks as well. Each added task makes it easier to learn important features, which improves performance on other tasks [2].

To get started, let's load the MUV dataset.  The MoleculeNet loader function automatically splits it into training, validation, and test sets.
To get started, let's load the MUV dataset.  The MoleculeNet loader function automatically splits it into training, validation, and test sets.  Because there are so few positive examples, we use stratified splitting to ensure the test set has enough of them to evaluate.

%% Cell type:code id: tags:

``` python
import deepchem as dc
import numpy as np

tasks, datasets, transformers = dc.molnet.load_muv(split='random')
tasks, datasets, transformers = dc.molnet.load_muv(split='stratified')
train_dataset, valid_dataset, test_dataset = datasets
```

%% Cell type:markdown id: tags:

Now let's train a model on it.  We'll use a MultitaskClassifier, which is a simple stack of fully connected layers.

%% Cell type:code id: tags:

``` python
n_tasks = len(tasks)
n_features = train_dataset.get_data_shape()[0]
model = dc.models.MultitaskClassifier(n_tasks, n_features)
model.fit(train_dataset)
```

%% Output

    0.0005275170505046844
    0.0004961589723825455

%% Cell type:markdown id: tags:

Let's see how well it does on the test set.  We loop over the 17 tasks and compute the ROC AUC for each one.  We need to be a little careful when doing this.  Because there are so few positive samples in the dataset, it is possible the test set could have ended up with none at all for some tasks.  To ensure we have enough data to compute a meaningful result, we only compute the score for tasks with at least three positive samples.
Let's see how well it does on the test set.  We loop over the 17 tasks and compute the ROC AUC for each one.

%% Cell type:code id: tags:

``` python
y_true = test_dataset.y
y_pred = model.predict(test_dataset)
metric = dc.metrics.roc_auc_score
for i in range(n_tasks):
    if np.sum(y_true[:,i]) > 2:
        score = metric(dc.metrics.to_one_hot(y_true[:,i]), y_pred[:,i])
        print(tasks[i], score)
    else:
        print(tasks[i], 'Not enough positives in test set')
    score = metric(dc.metrics.to_one_hot(y_true[:,i]), y_pred[:,i])
    print(tasks[i], score)
```

%% Output

    MUV-466 0.8244303525365435
    MUV-548 0.9732469102632992
    MUV-600 0.9187262697900995
    MUV-644 Not enough positives in test set
    MUV-652 0.7619760881246641
    MUV-689 0.9622734436564224
    MUV-692 0.5174011177987962
    MUV-712 0.5857469102632993
    MUV-713 Not enough positives in test set
    MUV-733 Not enough positives in test set
    MUV-737 Not enough positives in test set
    MUV-810 0.6271829661472326
    MUV-832 0.6916684576259045
    MUV-846 0.9023643202579259
    MUV-852 0.7483207952713595
    MUV-858 0.9691686367218282
    MUV-859 0.46041420277389533
    MUV-466 0.9207684040838259
    MUV-548 0.7480655561526062
    MUV-600 0.9927995701235895
    MUV-644 0.9974207415368082
    MUV-652 0.7823481998925309
    MUV-689 0.6636843990686011
    MUV-692 0.6319093677234462
    MUV-712 0.7787838079885365
    MUV-713 0.7910711087229088
    MUV-733 0.4401307540748701
    MUV-737 0.34679383843811573
    MUV-810 0.9564571019165323
    MUV-832 0.9991044241447251
    MUV-846 0.7519881783987103
    MUV-852 0.8516747268493642
    MUV-858 0.5906591438294824
    MUV-859 0.5962954008166774

%% Cell type:markdown id: tags:

Not bad!  Recall that random guessing would produce a ROC AUC score of 0.5, and a perfect predictor would score 1.0.  Most of the tasks did much better than random guessing, and many of them are above 0.9.

# Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

## Join the DeepChem Gitter
The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!

# Bibliography

[1] https://pubs.acs.org/doi/10.1021/ci8002649

[2] https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00146

%% Cell type:code id: tags:

``` python
```
+0 −390

File deleted.

Preview size limit exceeded, changes collapsed.