Commit fea4c40a authored by peastman's avatar peastman
Browse files

More updates to tutorial sequence

parent 0d3f2285
Loading
Loading
Loading
Loading
+0 −1334

File deleted.

Preview size limit exceeded, changes collapsed.

+25 −28
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Tutorial Part 5: Putting Multitask Learning to Work
# Tutorial Part 11: Putting Multitask Learning to Work

This notebook walks through the creation of multitask models on MUV [1]. The goal is to demonstrate how multitask methods can provide improved performance in situations with little or very unbalanced data.

## Colab

This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/05_Putting_Multitask_Learning_to_Work.ipynb)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/11_Putting_Multitask_Learning_to_Work.ipynb)


## Setup

To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.

%% Cell type:code id: tags:

``` python
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e
```

%% Cell type:code id: tags:

``` python
!pip install --pre deepchem
import deepchem
deepchem.__version__
```

%% Cell type:markdown id: tags:

The MUV dataset is a challenging benchmark in molecular design that consists of 17 different "targets" where there are only a few "active" compounds per target. There are 93,087 compounds in total, yet no task has more than 30 active compounds, and many have even less. Training a model with such a small number of positive examples is very challenging.  Multitask models address this by training a single model that predicts all the different targets at once. If a feature is useful for predicting one task, it often is useful for predicting several other tasks as well. Each added task makes it easier to learn important features, which improves performance on other tasks [2].

To get started, let's load the MUV dataset.  The MoleculeNet loader function automatically splits it into training, validation, and test sets.
To get started, let's load the MUV dataset.  The MoleculeNet loader function automatically splits it into training, validation, and test sets.  Because there are so few positive examples, we use stratified splitting to ensure the test set has enough of them to evaluate.

%% Cell type:code id: tags:

``` python
import deepchem as dc
import numpy as np

tasks, datasets, transformers = dc.molnet.load_muv(split='random')
tasks, datasets, transformers = dc.molnet.load_muv(split='stratified')
train_dataset, valid_dataset, test_dataset = datasets
```

%% Cell type:markdown id: tags:

Now let's train a model on it.  We'll use a MultitaskClassifier, which is a simple stack of fully connected layers.

%% Cell type:code id: tags:

``` python
n_tasks = len(tasks)
n_features = train_dataset.get_data_shape()[0]
model = dc.models.MultitaskClassifier(n_tasks, n_features)
model.fit(train_dataset)
```

%% Output

    0.0005275170505046844
    0.0004961589723825455

%% Cell type:markdown id: tags:

Let's see how well it does on the test set.  We loop over the 17 tasks and compute the ROC AUC for each one.  We need to be a little careful when doing this.  Because there are so few positive samples in the dataset, it is possible the test set could have ended up with none at all for some tasks.  To ensure we have enough data to compute a meaningful result, we only compute the score for tasks with at least three positive samples.
Let's see how well it does on the test set.  We loop over the 17 tasks and compute the ROC AUC for each one.

%% Cell type:code id: tags:

``` python
y_true = test_dataset.y
y_pred = model.predict(test_dataset)
metric = dc.metrics.roc_auc_score
for i in range(n_tasks):
    if np.sum(y_true[:,i]) > 2:
        score = metric(dc.metrics.to_one_hot(y_true[:,i]), y_pred[:,i])
        print(tasks[i], score)
    else:
        print(tasks[i], 'Not enough positives in test set')
    score = metric(dc.metrics.to_one_hot(y_true[:,i]), y_pred[:,i])
    print(tasks[i], score)
```

%% Output

    MUV-466 0.8244303525365435
    MUV-548 0.9732469102632992
    MUV-600 0.9187262697900995
    MUV-644 Not enough positives in test set
    MUV-652 0.7619760881246641
    MUV-689 0.9622734436564224
    MUV-692 0.5174011177987962
    MUV-712 0.5857469102632993
    MUV-713 Not enough positives in test set
    MUV-733 Not enough positives in test set
    MUV-737 Not enough positives in test set
    MUV-810 0.6271829661472326
    MUV-832 0.6916684576259045
    MUV-846 0.9023643202579259
    MUV-852 0.7483207952713595
    MUV-858 0.9691686367218282
    MUV-859 0.46041420277389533
    MUV-466 0.9207684040838259
    MUV-548 0.7480655561526062
    MUV-600 0.9927995701235895
    MUV-644 0.9974207415368082
    MUV-652 0.7823481998925309
    MUV-689 0.6636843990686011
    MUV-692 0.6319093677234462
    MUV-712 0.7787838079885365
    MUV-713 0.7910711087229088
    MUV-733 0.4401307540748701
    MUV-737 0.34679383843811573
    MUV-810 0.9564571019165323
    MUV-832 0.9991044241447251
    MUV-846 0.7519881783987103
    MUV-852 0.8516747268493642
    MUV-858 0.5906591438294824
    MUV-859 0.5962954008166774

%% Cell type:markdown id: tags:

Not bad!  Recall that random guessing would produce a ROC AUC score of 0.5, and a perfect predictor would score 1.0.  Most of the tasks did much better than random guessing, and many of them are above 0.9.

# Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

## Join the DeepChem Gitter
The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!

# Bibliography

[1] https://pubs.acs.org/doi/10.1021/ci8002649

[2] https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00146

%% Cell type:code id: tags:

``` python
```
+0 −189
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Tutorial Part 20: Converting DeepChem models to TensorFlow Estimators

So far, we've walked through a lot of the scientific details tied to molecular machine learning, but we haven't discussed as much how to use tools like DeepChem in production settings. This tutorial (and the last) focus more on the practical matters of how to use DeepChem in production settings.

When DeepChem was first created, Tensorflow had no standard interface for datasets or models.  We created the Dataset and Model classes to fill this hole.  More recently, Tensorflow has added the `tf.data` module as a standard interface for datasets, and the `tf.estimator` module as a standard interface for models.  To enable easy interoperability with other tools, we have added features to Dataset and Model to support these new standards. Using the Estimator interface may make it easier to deply DeepChem models in production environments.

This example demonstrates how to use these features.  Let's begin by loading a dataset and creating a model to analyze it.  We'll use a simple MultitaskClassifier with one hidden layer.

## Colab

This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/20_Converting_DeepChem_Models_to_TensorFlow_Estimators.ipynb)

## Setup

To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.

%% Cell type:code id: tags:

``` 
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e
```

%% Output

      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  3489  100  3489    0     0  28834      0 --:--:-- --:--:-- --:--:-- 28834

    add /root/miniconda/lib/python3.6/site-packages to PYTHONPATH
    all packages is already installed

    # conda environments:
    #
    base                  *  /root/miniconda
    

%% Cell type:code id: tags:

``` 
!pip install --pre deepchem
import deepchem
deepchem.__version__
```

%% Output

    Requirement already satisfied: deepchem in /usr/local/lib/python3.6/dist-packages (2.4.0rc1.dev20200805145942)
    Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from deepchem) (0.16.0)
    Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.4.1)
    Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from deepchem) (0.22.2.post1)
    Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.18.5)
    Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.0.5)
    Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->deepchem) (2018.9)
    Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas->deepchem) (2.8.1)
    Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas->deepchem) (1.15.0)

    '2.4.0-rc1.dev'

%% Cell type:code id: tags:

``` 
import deepchem as dc
import tensorflow as tf
import numpy as np

tasks, datasets, transformers = dc.molnet.load_tox21(reload=False)
train_dataset, valid_dataset, test_dataset = datasets
n_tasks = len(tasks)
n_features = train_dataset.X.shape[1]

model = dc.models.MultitaskClassifier(n_tasks, n_features, layer_sizes=[1000], dropouts=0.25)
```

%% Output

    smiles_field is deprecated and will be removed in a future version of DeepChem. Use feature_field instead.
    /usr/local/lib/python3.6/dist-packages/deepchem/data/data_loader.py:198: FutureWarning: featurize() is deprecated and has been renamed to create_dataset(). featurize() will be removed in DeepChem 3.0
      FutureWarning)

%% Cell type:markdown id: tags:

We want to train the model using the training set, then evaluate it on the test set.  As our evaluation metric we will use the ROC AUC, averaged over the 12 tasks included in the dataset.  First let's see how to do this with the DeepChem API.

%% Cell type:code id: tags:

``` 
model.fit(train_dataset, nb_epoch=10)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean)
print(model.evaluate(test_dataset, [metric]))
```

%% Output

    n_samples is a deprecated argument which is ignored.
    n_samples is a deprecated argument which is ignored.
    n_samples is a deprecated argument which is ignored.
    n_samples is a deprecated argument which is ignored.
    n_samples is a deprecated argument which is ignored.
    n_samples is a deprecated argument which is ignored.
    n_samples is a deprecated argument which is ignored.
    n_samples is a deprecated argument which is ignored.
    n_samples is a deprecated argument which is ignored.
    n_samples is a deprecated argument which is ignored.
    n_samples is a deprecated argument which is ignored.
    n_samples is a deprecated argument which is ignored.

    {'mean-roc_auc_score': 0.7669682534913908}

%% Cell type:markdown id: tags:

Simple enough.  Now let's see how to do the same thing with the Tensorflow APIs.  Fair warning: this is going to take a lot more code!

To begin with, Tensorflow doesn't allow a dataset to be passed directly to a model.  Instead, you need to write an "input function" to construct a particular set of tensors and return them in a particular format.  Fortunately, Dataset's `make_iterator()` method provides exactly the tensors we need in the form of a `tf.data.Iterator`.  This allows our input function to be very simple.

%% Cell type:code id: tags:

``` 
def input_fn(dataset, epochs):
    x, y, weights = dataset.make_iterator(batch_size=100, epochs=epochs).get_next()
    return {'x': x, 'weights': weights}, y
```

%% Cell type:markdown id: tags:

Next, you have to use the functions in the `tf.feature_column` module to create an object representing each feature and weight column (but curiously, *not* the label column—don't ask me why!).  These objects describe the data type and shape of each column, and give each one a name.  The names must match the keys in the dict returned by the input function.

%% Cell type:code id: tags:

``` 
x_col = tf.feature_column.numeric_column('x', shape=(n_features,))
weight_col = tf.feature_column.numeric_column('weights', shape=(n_tasks,))
```

%% Cell type:markdown id: tags:

Unlike DeepChem models, which allow arbitrary metrics to be passed to `evaluate()`, estimators require all metrics to be defined up front when you create the estimator.  Unfortunately, Tensorflow doesn't have very good support for multitask models.  It provides an AUC metric, but no easy way to average this metric over tasks.  We therefore must create a separate metric for every task, then define our own metric function to compute the average of them.

%% Cell type:code id: tags:

``` 
def mean_auc(labels, predictions, weights):
    metric_ops = []
    update_ops = []
    for i in range(n_tasks):
        metric, update = tf.metrics.auc(labels[:,i], predictions[:,i], weights[:,i])
        metric_ops.append(metric)
        update_ops.append(update)
    mean_metric = tf.reduce_mean(tf.stack(metric_ops))
    update_all = tf.group(*update_ops)
    return mean_metric, update_all
```

%% Cell type:markdown id: tags:

Now we create our `Estimator` by calling `make_estimator()` on the DeepChem model.  We provide as arguments the objects created above to represent the feature and weight columns, as well as our metric function.

%% Cell type:code id: tags:

``` 
#estimator = model.make_estimator(feature_columns=[x_col],
#                                 weight_column=weight_col,
#                                 metrics={'mean_auc': mean_auc},
#                                 model_dir='estimator')
# estimator = tf.keras.estimator.model_to_estimator(model)
```

%% Cell type:markdown id: tags:

We are finally ready to train and evaluate it!  Notice how the input function passed to each method is actually a lambda.  This allows us to write a single function, then use it with different datasets and numbers of epochs.

%% Cell type:code id: tags:

``` 
# estimator.train(input_fn=lambda: input_fn(train_dataset, 100))
# print(estimator.evaluate(input_fn=lambda: input_fn(test_dataset, 1)))
```

%% Cell type:markdown id: tags:

That's a lot of code for something DeepChem can do in three lines.  The Tensorflow API is verbose and somewhat confusing.  It has seemingly arbitrary limitations, like assuming a model will only ever have one output, and therefore only allowing one label.  But for better or worse, it's a standard.

Of course, if you just want to use a DeepChem model with a DeepChem dataset, there is no need for any of this.  Just use the DeepChem API.  But perhaps you want to use a DeepChem dataset with a model that has been implemented as an estimator.  In that case, `Dataset.make_iterator()` allows you to easily do that.  Or perhaps you have higher level workflow code that is written to work with estimators.  In that case, `make_estimator()` allows DeepChem models to easily fit into that workflow.