Unverified Commit c33f55d0 authored by Bharath Ramsundar's avatar Bharath Ramsundar Committed by GitHub
Browse files

Merge pull request #1759 from deepchem/tutorial_overhaul

Expanding the tutorial sequence
parents 8b52d26f 74fad77c
Loading
Loading
Loading
Loading
+55 −8
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Tutorial: Deep Life Sciences
Welcome to DeepChem's introductory tutorial for the deep life sciences. This series of notebooks is step-by-step guide for you to get to know the new tools and techniques needed to do deep learning for the life sciences.
# Tutorial 1: The Basic Tools of the Deep Life Sciences
Welcome to DeepChem's introductory tutorial for the deep life sciences. This series of notebooks is step-by-step guide for you to get to know the new tools and techniques needed to do deep learning for the life sciences. We'll start from the basics, assuming that you're new to machine learning and the life sciences, and build up a repertoire of tools and techniques that you can use to do meaningful work in the life sciences.

**Scope:** This tutorial will encompass both the machine learning and data handling needed to build systems for the deep life sciences.

## Colab

This tutorial and the rest in the sequences are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/01_The_Basic_Tools_of_the_Deep_Life_Sciences.ipynb)

## Outline
* Part 1: The Basic Tools of the Deep Life Sciences
* Part 2: Introduction to Molecular Data Handling
* Part 3: Molecular Machine Learning
* Part 4:

## Why do the DeepChem Tutorial?

**1) Career Advancement:** Applying AI in the life sciences is a booming
industry at present. There are a host of newly funded startups and initiatives
at large pharmaceutical and biotech companies centered around AI. Learning and
mastering DeepChem will bring you to the forefront of this field and will
prepare you to enter a career in this field.

**2) Humanitarian Considerations:** Disease is the oldest cause of human
suffering. From the dawn of human civilization, humans have suffered from pathogens,
cancers, and neurological conditions. One of the greatest achievements of
the last few centuries has been the development of effective treatments for
many diseases. By mastering the skills in this tutorial, you will be able to
stand on the shoulders of the giants of the past to help develop new
medicine.

**3) Lowering the Cost of Medicine:** The art of developing new medicine is
currently an elite skill that can only be practiced by a small core of expert
practitioners. By enabling the growth of open source tools for drug discovery,
you can help democratize these skills and open up drug discovery to more
competition. Increased competition can help drive down the cost of medicine.

## Getting Extra Credit
If you're excited about DeepChem and want to get more more involved, there's a couple of things that you can do right now:
* Start DeepChem on GitHub! - https://github.com/deepchem/deepchem

* Star DeepChem on GitHub! - https://github.com/deepchem/deepchem
* Join the DeepChem forums and introduce yourself! - https://forum.deepchem.io
* Say hi on the DeepChem gitter - https://gitter.im/deepchem/Lobby
* Make a YouTube video teaching the contents of this notebook.


## Part -1: Prerequisites

This tutorial will assume some basic familiarity with the Python data science ecosystem. We will assume that you have familiarity with libraries such as Numpy, Pandas, and TensorFlow.

## Part 0: Setup
The first step is to get DeepChem up and running. We recommend using conda for now to do this install.
```
conda install -c deepchem -c rdkit -c conda-forge -c omnia deepchem=2.1.0

The first step is to get DeepChem up and running. We recommend using Google Colab to work through this tutorial series. You'll need to run the following commands to get DeepChem installed on your colab notebook. Note that this will take something like 5 minutes to run on your colab instance.

%% Cell type:code id: tags:

``` python
!wget -c https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh
!chmod +x Anaconda3-2019.10-Linux-x86_64.sh
!bash ./Anaconda3-2019.10-Linux-x86_64.sh -b -f -p /usr/local
!conda install -y -c deepchem -c rdkit -c conda-forge -c omnia deepchem-gpu=2.3.0
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')
```

%% Cell type:markdown id: tags:

You can of course run this tutorial locally if you prefer. In this case, don't run the above cell since it will download and install Anaconda on your local machine. In either case, we can now import `deepchem` the package to play with.

%% Cell type:code id: tags:

``` python
# Run this cell to see if things work
import deepchem as dc
```

%% Output

    /home/bharath/anaconda3/envs/deepchem/lib/python3.5/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
      from numpy.core.umath_tests import inner1d
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
      warnings.warn(msg, category=FutureWarning)
    RDKit WARNING: [16:55:20] Enabling RDKit 2019.09.3 jupyter extensions
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint8 = np.dtype([("qint8", np.int8, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint16 = np.dtype([("qint16", np.int16, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint32 = np.dtype([("qint32", np.int32, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      np_resource = np.dtype([("resource", np.ubyte, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint8 = np.dtype([("qint8", np.int8, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint16 = np.dtype([("qint16", np.int16, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint32 = np.dtype([("qint32", np.int32, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      np_resource = np.dtype([("resource", np.ubyte, 1)])

%% Cell type:markdown id: tags:

## The Basic Tools of the Deep Life Sciences
What does it take to do deep learning on the life sciences? Well, the first thing we'll need to do is actually handle some data. How can we start handling some basic data? For beginners, let's just take a look at some synthetic data.

To generate some basic synthetic data, we will use Numpy to create some basic arrays.

%% Cell type:code id: tags:

``` python
import numpy as np

data = np.random.random((4, 4))
labels = np.random.random((4,)) # labels of size 20x1
```

%% Cell type:markdown id: tags:

We've given these arrays some evocative names: "data" and "labels." For now, don't worry too much about the names, but just note that the arrays have different shapes. Let's take a quick look to get a feeling for these arrays

%% Cell type:code id: tags:

``` python
data, labels
```

%% Output

    (array([[0.17153735, 0.72653504, 0.75818459, 0.64997769],
            [0.64356789, 0.37895973, 0.46143683, 0.3251195 ],
            [0.51409105, 0.20522909, 0.29532684, 0.35239749],
            [0.49242761, 0.62127102, 0.77898693, 0.90960543]]),
     array([0.01939268, 0.43336842, 0.91222562, 0.23498551]))

%% Cell type:markdown id: tags:

In order to be able to work with this data in DeepChem, we need to wrap these arrays so DeepChem knows how to work with them. DeepChem has a `Dataset` API that it uses to facilitate its handling of datasets. For handling of Numpy datasets, we use DeepChem's `NumpyDataset` object.

%% Cell type:code id: tags:

``` python
from deepchem.data.datasets import NumpyDataset

dataset = NumpyDataset(data, labels)
```

%% Cell type:markdown id: tags:

Ok, now what? We have these arrays in a `NumpyDataset` object. What can we do with it? Let's try printing out the object.

%% Cell type:code id: tags:

``` python
dataset
```

%% Output

    <deepchem.data.datasets.NumpyDataset at 0x7ff02682c710>

%% Cell type:markdown id: tags:

Ok, that's not terribly informative. It's telling us that `dataset` is a Python object that lives somewhere in memory. Can we recover the two datasets that we used to construct this object? Luckily, the DeepChem API allows us to recover the two original datasets by calling the `dataset.X` and `dataset.y` attributes of the original object.

%% Cell type:code id: tags:

``` python
dataset.X, dataset.y
```

%% Output

    (array([[0.17153735, 0.72653504, 0.75818459, 0.64997769],
            [0.64356789, 0.37895973, 0.46143683, 0.3251195 ],
            [0.51409105, 0.20522909, 0.29532684, 0.35239749],
            [0.49242761, 0.62127102, 0.77898693, 0.90960543]]),
     array([0.01939268, 0.43336842, 0.91222562, 0.23498551]))

%% Cell type:markdown id: tags:

This set of transformations raises a few questions. First, what was the point of it all? Why would we want to wrap objects this way instead of working with the raw Numpy arrays? The simple answer is for have a unified API for working with larger datasets. Suppose that `X` and `y` are so large that they can't fit easily into memory. What would we do then? Being able to work with an abstract `dataset` object proves very convenient then. In fact, you'll have reason to use this feature of `Dataset` later in the tutorial series.

What else can we do with the `dataset` object? It turns out that it can be useful to be able to walk through the datapoints in the `dataset` one at a time. For that, we can use the `dataset.itersamples()` method.

%% Cell type:code id: tags:

``` python
for x, y, _, _ in dataset.itersamples():
    print(x, y)
```

%% Output

    [0.17153735 0.72653504 0.75818459 0.64997769] 0.019392679983928796
    [0.64356789 0.37895973 0.46143683 0.3251195 ] 0.43336841680990135
    [0.51409105 0.20522909 0.29532684 0.35239749] 0.9122256174354443
    [0.49242761 0.62127102 0.77898693 0.90960543] 0.23498551323364447

%% Cell type:markdown id: tags:

There are a couple of other fields that the `dataset` object tracks. The first is `dataset.ids`. This is a listing of unique identifiers for the datapoitns in the dataset.

%% Cell type:code id: tags:

``` python
dataset.ids
```

%% Output

    array([0, 1, 2, 3], dtype=object)

%% Cell type:markdown id: tags:

In addition, the `dataset` object has a field `dataset.w`. This is the "example weight" associated with each datapoint. Since we haven't explicitly assigned the weights, this is simply going to be all ones.

%% Cell type:code id: tags:

``` python
dataset.w
```

%% Output

    array([1., 1., 1., 1.])

%% Cell type:markdown id: tags:

Alright, we've seen some basic features. What if you want to learn more about `NumpyDataset`? You should check out our more in-depth [notebook](https://deepchem.io/docs/notebooks/Deepchem_NumpyDataset_tutorial.html) that goes into much more depth on how to work with `NumpyDataset` objects.
+15 −1
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Tutorial Part 2: Learning MNIST Digit Classifiers

In the previous tutorial, we learned some basics of how to load data into DeepChem and how to use the basic DeepChem objects to load and manipulate this data. In this tutorial, you'll put the parts together and learn how to train a basic image classification model in DeepChem. You might ask, why are we bothering to learn this material in DeepChem? Part of the reason is that image processing is an increasingly important part of AI for the life sciences. So learning how to train image processing models will be very useful for using some of the more advanced DeepChem features.

The MNIST dataset contains handwritten digits along with their human annotated labels. The learning challenge for this dataset is to train a model that maps the digit image to its true label. MNIST has been a standard benchmark for machine learning for decades at this point.

![MNIST](mnist_examples.png)

For convenience, TensorFlow has provided some loader methods to get access to the MNIST dataset. We'll make use of these loaders.
## Setup

We recommend running this tutorial on Google colab. You'll need to run the following cell of installation commands on Colab to get your environment set up. If you'd rather run the tutorial locally, make sure you don't run these commands (since they'll download and install a new Anaconda python setup)

%% Cell type:code id: tags:

``` python
!wget -c https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh
!chmod +x Anaconda3-2019.10-Linux-x86_64.sh
!bash ./Anaconda3-2019.10-Linux-x86_64.sh -b -f -p /usr/local
!conda install -y -c deepchem -c rdkit -c conda-forge -c omnia deepchem-gpu=2.3.0
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')
import deepchem as dc
```

%% Cell type:code id: tags:

``` python
from tensorflow.examples.tutorials.mnist import input_data
```

%% Cell type:code id: tags:

``` python
# TODO: This is deprecated. Let's replace with a DeepChem native loader for maintainability.
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
```

%% Output

    Extracting MNIST_data/train-images-idx3-ubyte.gz
    Extracting MNIST_data/train-labels-idx1-ubyte.gz
    Extracting MNIST_data/t10k-images-idx3-ubyte.gz
    Extracting MNIST_data/t10k-labels-idx1-ubyte.gz

%% Cell type:code id: tags:

``` python
import deepchem as dc
import tensorflow as tf
from tensorflow.keras.layers import Reshape, Conv2D, Flatten, Dense, Softmax
```

%% Cell type:code id: tags:

``` python
train = dc.data.NumpyDataset(mnist.train.images, mnist.train.labels)
valid = dc.data.NumpyDataset(mnist.validation.images, mnist.validation.labels)
```

%% Cell type:code id: tags:

``` python
keras_model = tf.keras.Sequential([
    Reshape((28, 28, 1)),
    Conv2D(filters=32, kernel_size=5, activation=tf.nn.relu),
    Conv2D(filters=64, kernel_size=5, activation=tf.nn.relu),
    Flatten(),
    Dense(1024, activation=tf.nn.relu),
    Dense(10),
    Softmax()
])
model = dc.models.KerasModel(keras_model, dc.models.losses.CategoricalCrossEntropy())
```

%% Cell type:code id: tags:

``` python
model.fit(train, nb_epoch=2)
```

%% Output

    0.0

%% Cell type:code id: tags:

``` python
from sklearn.metrics import roc_curve, auc
import numpy as np

print("Validation")
prediction = np.squeeze(model.predict_on_batch(valid.X))

fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(10):
    fpr[i], tpr[i], thresh = roc_curve(valid.y[:, i], prediction[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])
    print("class %s:auc=%s" % (i, roc_auc[i]))
```

%% Output

    Validation
    class 0:auc=0.9998836328172079
    class 1:auc=0.9999571662641497
    class 2:auc=0.9998310516219043
    class 3:auc=0.9999563446718672
    class 4:auc=0.9999418111793702
    class 5:auc=0.9995639983771051
    class 6:auc=0.9998478260194437
    class 7:auc=0.9998357507660879
    class 8:auc=0.9999599342922393
    class 9:auc=0.9998551553268534
+175 −7
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

## Molecular Featurization Tutorial
# Tutorial Part 4: Going Deeper On Molecular Featurizations

One of the most important steps of doing machine learning on molecular data is transforming this data into a form amenable to the application of learning algorithms. This process is broadly called "featurization" and involves tutrning a molecule into a vector or tensor of some sort. There are a number of different ways of doing such transformations, and the choice of featurization is often dependent on the problem at hand.

In this tutorial,  we explore the different featurization methods available for molecules. These featurization methods include:

1. `ConvMolFeaturizer`,
2. `WeaveFeaturizer`,
3. `CircularFingerprints`
4. `RDKitDescriptors`
5. `BPSymmetryFunction`
6. `CoulombMatrix`
7. `CoulombMatrixEig`
8. `AdjacencyFingerprints`

%% Cell type:markdown id: tags:

Let's start with some basic imports

%% Cell type:code id: tags:

``` python
from __future__ import print_function
from __future__ import division
from __future__ import unicode_literals

import numpy as np
from rdkit import Chem

from deepchem.feat import ConvMolFeaturizer, WeaveFeaturizer, CircularFingerprint
from deepchem.feat import AdjacencyFingerprint, RDKitDescriptors
from deepchem.feat import BPSymmetryFunction, CoulombMatrix, CoulombMatrixEig
from deepchem.feat import BPSymmetryFunctionInput, CoulombMatrix, CoulombMatrixEig
from deepchem.utils import conformers
```

%% Cell type:markdown id: tags:

We use `propane`( $CH_3 CH_2 CH_3 $ ) as a running example throughout this tutorial. Many of the featurization methods use conformers or the molecules. A conformer can be generated using the `ConformerGenerator` class in `deepchem.utils.conformers`.

%% Cell type:markdown id: tags:

### RDKitDescriptors

%% Cell type:markdown id: tags:

`RDKitDescriptors` featurizes a molecule by computing descriptors values for specified descriptors. Intrinsic to the featurizer is a set of allowed descriptors, which can be accessed using `RDKitDescriptors.allowedDescriptors`.

The featurizer uses the descriptors in `rdkit.Chem.Descriptors.descList`, checks if they are in the list of allowed descriptors and computes the descriptor value for the molecule.

%% Cell type:code id: tags:

``` python
example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)
```

%% Cell type:markdown id: tags:

Let's check the allowed list of descriptors (Uncomment code below when needed)
Let's check the allowed list of descriptors. As you will see shortly, there's a wide range of chemical properties that RDKit computes for us.

%% Cell type:code id: tags:

``` python
# for descriptor in RDKitDescriptors.allowedDescriptors:
#     print(descriptor)
for descriptor in RDKitDescriptors.allowedDescriptors:
    print(descriptor)
```

%% Output

    NumAromaticHeterocycles
    EState_VSA7
    EState_VSA6
    MolMR
    BertzCT
    SMR_VSA10
    NHOHCount
    MinPartialCharge
    HallKierAlpha
    MinEStateIndex
    Chi1n
    Chi4n
    ExactMolWt
    VSA_EState8
    SMR_VSA5
    SMR_VSA9
    NumAliphaticCarbocycles
    VSA_EState2
    SlogP_VSA6
    VSA_EState7
    PEOE_VSA7
    NumHeteroatoms
    Chi1v
    PEOE_VSA2
    SMR_VSA4
    PEOE_VSA9
    HeavyAtomCount
    NumRadicalElectrons
    EState_VSA3
    NumValenceElectrons
    EState_VSA5
    PEOE_VSA10
    EState_VSA11
    EState_VSA10
    SMR_VSA7
    Chi1
    RingCount
    NumHDonors
    LabuteASA
    VSA_EState1
    Chi2v
    NumSaturatedCarbocycles
    SMR_VSA8
    Chi3v
    EState_VSA9
    Kappa2
    NumAliphaticHeterocycles
    Chi0
    SMR_VSA1
    SMR_VSA2
    PEOE_VSA1
    MolLogP
    NumAliphaticRings
    MinAbsPartialCharge
    BalabanJ
    Kappa1
    PEOE_VSA13
    EState_VSA4
    SlogP_VSA11
    MolWt
    SMR_VSA3
    Chi2n
    VSA_EState3
    MaxEStateIndex
    PEOE_VSA11
    Ipc
    MaxAbsPartialCharge
    Chi0n
    VSA_EState10
    VSA_EState5
    EState_VSA1
    FractionCSP3
    Kappa3
    MaxPartialCharge
    PEOE_VSA6
    SlogP_VSA7
    NumHAcceptors
    NumAromaticCarbocycles
    SMR_VSA6
    Chi3n
    HeavyAtomMolWt
    SlogP_VSA8
    VSA_EState9
    PEOE_VSA3
    SlogP_VSA5
    NumRotatableBonds
    PEOE_VSA14
    SlogP_VSA9
    NOCount
    VSA_EState4
    PEOE_VSA5
    Chi0v
    NumSaturatedHeterocycles
    EState_VSA8
    SlogP_VSA12
    Chi4v
    SlogP_VSA10
    NumSaturatedRings
    MinAbsEStateIndex
    SlogP_VSA2
    SlogP_VSA1
    PEOE_VSA8
    PEOE_VSA4
    TPSA
    SlogP_VSA4
    SlogP_VSA3
    NumAromaticRings
    MaxAbsEStateIndex
    EState_VSA2
    VSA_EState6
    PEOE_VSA12

%% Cell type:code id: tags:

``` python
rdkit_desc = RDKitDescriptors()
features = rdkit_desc._featurize(example_mol)

print('The number of descriptors present are: ', len(features))
```

%% Output

    The number of descriptors present are:  111

%% Cell type:markdown id: tags:

### BPSymmetryFunction

%% Cell type:markdown id: tags:

`Behler-Parinello Symmetry function` or `BPSymmetryFunction` featurizes a molecule by computing the atomic number and coordinates for each atom in the molecule. The features can be used as input for symmetry functions, like `RadialSymmetry`, `DistanceMatrix` and `DistanceCutoff` . More details on these symmetry functions can be found in [this paper](https://journals.aps.org/prl/pdf/10.1103/PhysRevLett.98.146401). These functions can be found in `deepchem.models.tensorgraph.symmetry_functions`
`Behler-Parinello Symmetry function` or `BPSymmetryFunction` featurizes a molecule by computing the atomic number and coordinates for each atom in the molecule. The features can be used as input for symmetry functions, like `RadialSymmetry`, `DistanceMatrix` and `DistanceCutoff` . More details on these symmetry functions can be found in [this paper](https://journals.aps.org/prl/pdf/10.1103/PhysRevLett.98.146401). These functions can be found in `deepchem.feat.coulomb_matrices`

The featurizer takes in `max_atoms` as an argument. As input, it takes in a conformer of the molecule and computes:

1. coordinates of every atom in the molecule (in Bohr units)
2. the atomic numbers for all atoms.

These features are concantenated and padded with zeros to account for different number of atoms, across molecules.

%% Cell type:code id: tags:

``` python
example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)
engine = conformers.ConformerGenerator(max_conformers=1)
example_mol = engine.generate_conformers(example_mol)
```

%% Cell type:markdown id: tags:

Let's now take a look at the actual featurized matrix that comes out.

%% Cell type:code id: tags:

``` python
bp_sym = BPSymmetryFunction(max_atoms=20)
bp_sym = BPSymmetryFunctionInput(max_atoms=20)
features = bp_sym._featurize(mol=example_mol)
features
```

%% Output

    array([[ 6.        ,  2.33166293, -0.52962788, -0.48097309],
           [ 6.        ,  0.0948792 ,  1.07597567, -1.33579553],
           [ 6.        , -2.40436371, -0.29483572, -0.90388318],
           [ 1.        ,  2.18166462, -0.95639011,  1.569049  ],
           [ 1.        ,  4.1178375 ,  0.51816193, -0.81949623],
           [ 1.        ,  2.39319787, -2.32844253, -1.56157176],
           [ 1.        ,  0.29919987,  1.51730566, -3.37889252],
           [ 1.        ,  0.08875543,  2.88229706, -0.26437996],
           [ 1.        , -3.99100651,  0.92016315, -1.54358853],
           [ 1.        , -2.66167993, -0.71627602,  1.136556  ],
           [ 1.        , -2.45014726, -2.08833123, -1.99406318],
           [ 0.        ,  0.        ,  0.        ,  0.        ],
           [ 0.        ,  0.        ,  0.        ,  0.        ],
           [ 0.        ,  0.        ,  0.        ,  0.        ],
           [ 0.        ,  0.        ,  0.        ,  0.        ],
           [ 0.        ,  0.        ,  0.        ,  0.        ],
           [ 0.        ,  0.        ,  0.        ,  0.        ],
           [ 0.        ,  0.        ,  0.        ,  0.        ],
           [ 0.        ,  0.        ,  0.        ,  0.        ],
           [ 0.        ,  0.        ,  0.        ,  0.        ]])

%% Cell type:markdown id: tags:

A simple check for the featurization would be to count the different atomic numbers present in the features.

%% Cell type:code id: tags:

``` python
atomic_numbers = features[:, 0]
from collections import Counter

unique_numbers = Counter(atomic_numbers)
print(unique_numbers)
```

%% Output

    Counter({0.0: 9, 1.0: 8, 6.0: 3})

%% Cell type:markdown id: tags:

For propane, we have $3$ `C-atoms` and $8$ `H-atoms`, and these numbers are in agreement with the results shown above. There's also the additional padding of 9 atoms, to equalize with `max_atoms`.

%% Cell type:markdown id: tags:

### CoulombMatrix

%% Cell type:markdown id: tags:

`CoulombMatrix`, featurizes a molecule by computing the coulomb matrices for different conformers of the molecule, and returning it as a list.

A Coulomb matrix tries to encode the energy structure of a molecule. The matrix is symmetric, with the off-diagonal elements capturing the Coulombic repulsion between pairs of atoms and the diagonal elements capturing atomic energies using the atomic numbers. More information on the functional forms used can be found [here](https://journals.aps.org/prl/pdf/10.1103/PhysRevLett.108.058301).

The featurizer takes in `max_atoms` as an argument and also has options for removing hydrogens from the molecule (`remove_hydrogens`), generating additional random coulomb matrices(`randomize`), and getting only the upper triangular matrix (`upper_tri`).

%% Cell type:code id: tags:

``` python
example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)

engine = conformers.ConformerGenerator(max_conformers=1)
example_mol = engine.generate_conformers(example_mol)

print("Number of available conformers for propane: ", len(example_mol.GetConformers()))
```

%% Output

    Number of available conformers for propane:  1

%% Cell type:code id: tags:

``` python
coulomb_mat = CoulombMatrix(max_atoms=20, randomize=False, remove_hydrogens=False, upper_tri=False)
features = coulomb_mat._featurize(mol=example_mol)
```

%% Cell type:markdown id: tags:

A simple check for the featurization is to see if the feature list has the same length as the number of conformers

%% Cell type:code id: tags:

``` python
print(len(example_mol.GetConformers()) == len(features))
```

%% Output

    True

%% Cell type:markdown id: tags:

### CoulombMatrixEig

%% Cell type:markdown id: tags:

`CoulombMatrix` is invariant to molecular rotation and translation, since the interatomic distances or atomic numbers do not change. However the matrix is not invariant to random permutations of the atom's indices. To deal with this, the `CoulumbMatrixEig` featurizer was introduced, which uses the eigenvalue spectrum of the columb matrix, and is invariant to random permutations of the atom's indices.

`CoulombMatrixEig` inherits from `CoulombMatrix` and featurizes a molecule by first computing the coulomb matrices for different conformers of the molecule and then computing the eigenvalues for each coulomb matrix. These eigenvalues are then padded to account for variation in number of atoms across molecules.

The featurizer takes in `max_atoms` as an argument and also has options for removing hydrogens from the molecule (`remove_hydrogens`), generating additional random coulomb matrices(`randomize`).

%% Cell type:code id: tags:

``` python
example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)

engine = conformers.ConformerGenerator(max_conformers=1)
example_mol = engine.generate_conformers(example_mol)

print("Number of available conformers for propane: ", len(example_mol.GetConformers()))
```

%% Output

    Number of available conformers for propane:  1

%% Cell type:code id: tags:

``` python
coulomb_mat_eig = CoulombMatrixEig(max_atoms=20, randomize=False, remove_hydrogens=False)
features = coulomb_mat_eig._featurize(mol=example_mol)
```

%% Cell type:code id: tags:

``` python
print(len(example_mol.GetConformers()) == len(features))
```

%% Output

    True

%% Cell type:markdown id: tags:

### Adjacency Fingerprints

%% Cell type:code id: tags:

``` python
```
+1 −1
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Tutorial Part 4: Uncertainty in Deep Learning
# Tutorial Part 5: Uncertainty in Deep Learning

A common criticism of deep learning models is that they tend to act as black boxes.  A model produces outputs, but doesn't given enough context to interpret them properly.  How reliable are the model's predictions?  Are some predictions more reliable than others?  If a model predicts a value of 5.372 for some quantity, should you assume the true value is between 5.371 and 5.373?  Or that it's between 2 and 8?  In some fields this situation might be good enough, but not in science.  For every value predicted by a model, we also want an estimate of the uncertainty in that value so we can know what conclusions to draw based on it.

DeepChem makes it very easy to estimate the uncertainty of predicted outputs (at least for the models that support it—not all of them do).  Let's start by seeing an example of how to generate uncertainty estimates.  We load a dataset, create a model, train it on the training set, and predict the output on the test set.

%% Cell type:code id: tags:

``` python
import deepchem as dc
import numpy as np
import matplotlib.pyplot as plot

tasks, datasets, transformers = dc.molnet.load_sampl()
train_dataset, valid_dataset, test_dataset = datasets

model = dc.models.MultitaskRegressor(len(tasks), 1024, uncertainty=True)
model.fit(train_dataset, nb_epoch=200)
y_pred, y_std = model.predict_uncertainty(test_dataset)
```

%% Output

    Loading dataset from disk.
    Loading dataset from disk.
    Loading dataset from disk.

%% Cell type:markdown id: tags:

All of this looks exactly like any other example, with just two differences.  First, we add the option `uncertainty=True` when creating the model.  This instructs it to add features to the model that are needed for estimating uncertainty.  Second, we call `predict_uncertainty()` instead of `predict()` to produce the output.  `y_pred` is the predicted outputs.  `y_std` is another array of the same shape, where each element is an estimate of the uncertainty (standard deviation) of the corresponding element in `y_pred`.  And that's all there is to it!  Simple, right?

Of course, it isn't really that simple at all.  DeepChem is doing a lot of work to come up with those uncertainties.  So now let's pull back the curtain and see what is really happening.  (For the full mathematical details of calculating uncertainty, see https://arxiv.org/abs/1703.04977)

To begin with, what does "uncertainty" mean?  Intuitively, it is a measure of how much we can trust the predictions.  More formally, we expect that the true value of whatever we are trying to predict should usually be within a few standard deviations of the predicted value.  But uncertainty comes from many sources, ranging from noisy training data to bad modelling choices, and different sources behave in different ways.  It turns out there are two fundamental types of uncertainty we need to take into account.

### Aleatoric Uncertainty

Consider the following graph.  It shows the best fit linear regression to a set of ten data points.

%% Cell type:code id: tags:

``` python
# Generate some fake data and plot a regression line.

x = np.linspace(0, 5, 10)
y = 0.15*x + np.random.random(10)
plot.scatter(x, y)
fit = np.polyfit(x, y, 1)
line_x = np.linspace(-1, 6, 2)
plot.plot(line_x, np.poly1d(fit)(line_x))
plot.show()
```

%% Output


%% Cell type:markdown id: tags:

The line clearly does not do a great job of fitting the data.  There are many possible reasons for this.  Perhaps the measuring device used to capture the data was not very accurate.  Perhaps `y` depends on some other factor in addition to `x`, and if we knew the value of that factor for each data point we could predict `y` more accurately.  Maybe the relationship between `x` and `y` simply isn't linear, and we need a more complicated model to capture it.  Regardless of the cause, the model clearly does a poor job of predicting the training data, and we need to keep that in mind.  We cannot expect it to be any more accurate on test data than on training data.  This is known as *aleatoric uncertainty*.

How can we estimate the size of this uncertainty?  By training a model to do it, of course!  At the same time it is learning to predict the outputs, it is also learning to predict how accurately each output matches the training data.  For every output of the model, we add a second output that produces the corresponding uncertainty.  Then we modify the loss function to make it learn both outputs at the same time.

### Epistemic Uncertainty

Now consider these three curves.  They are fit to the same data points as before, but this time we are using 10th degree polynomials.

%% Cell type:code id: tags:

``` python
plot.figure(figsize=(12, 3))
line_x = np.linspace(0, 5, 50)
for i in range(3):
    plot.subplot(1, 3, i+1)
    plot.scatter(x, y)
    fit = np.polyfit(np.concatenate([x, [3]]), np.concatenate([y, [i]]), 10)
    plot.plot(line_x, np.poly1d(fit)(line_x))
plot.show()
```

%% Output


%% Cell type:markdown id: tags:

Each of them perfectly interpolates the data points, yet they clearly are different models.  (In fact, there are infinitely many 10th degree polynomials that exactly interpolate any ten data points.)  They make identical predictions for the data we fit them to, but for any other value of `x` they produce different predictions.  This is called *epistemic uncertainty*.  It means the data does not fully constrain the model.  Given the training data, there are many different models we could have found, and those models make different predictions.

The ideal way to measure epistemic uncertainty is to train many different models, each time using a different random seed and possibly varying hyperparameters.  Then use all of them for each input and see how much the predictions vary.  This is very expensive to do, since it involves repeating the whole training process many times.  Fortunately, we can approximate the same effect in a less expensive way: by using dropout.

Recall that when you train a model with dropout, you are effectively training a huge ensemble of different models all at once.  Each training sample is evaluated with a different dropout mask, corresponding to a different random subset of the connections in the full model.  Usually we only perform dropout during training and use a single averaged mask for prediction.  But instead, let's use dropout for prediction too.  We can compute the output for lots of different dropout masks, then see how much the predictions vary.  This turns out to give a reasonable estimate of the epistemic uncertainty in the outputs.

### Uncertain Uncertainty?

Now we can combine the two types of uncertainty to compute an overall estimate of the error in each output:

$$\sigma_\text{total} = \sqrt{\sigma_\text{aleatoric}^2 + \sigma_\text{epistemic}^2}$$

This is the value DeepChem reports.  But how much can you trust it?  Remember how I started this tutorial: deep learning models should not be used as black boxes.  We want to know how reliable the outputs are.  Adding uncertainty estimates does not completely eliminate the problem; it just adds a layer of indirection.  Now we have estimates of how reliable the outputs are, but no guarantees that those estimates are themselves reliable.

Let's go back to the example we started with.  We trained a model on the SAMPL training set, then generated predictions and uncertainties for the test set.  Since we know the correct outputs for all the test samples, we can evaluate how well we did.  Here is a plot of the absolute error in the predicted output versus the predicted uncertainty.

%% Cell type:code id: tags:

``` python
abs_error = np.abs(y_pred.flatten()-test_dataset.y.flatten())
plot.scatter(y_std.flatten(), abs_error)
plot.xlabel('Standard Deviation')
plot.ylabel('Absolute Error')
plot.show()
```

%% Output


%% Cell type:markdown id: tags:

The first thing we notice is that the axes have similar ranges.  The model clearly has learned the overall magnitude of errors in the predictions.  There also is clearly a correlation between the axes.  Values with larger uncertainties tend on average to have larger errors.

Now let's see how well the values satisfy the expected distribution.  If the standard deviations are correct, and if the errors are normally distributed (which is certainly not guaranteed to be true!), we expect 95% of the values to be within two standard deviations, and 99% to be within three standard deviations.  Here is a histogram of errors as measured in standard deviations.

%% Cell type:code id: tags:

``` python
plot.hist(abs_error/y_std.flatten(), 20)
plot.show()
```

%% Output


%% Cell type:markdown id: tags:

Most of the values are in the expected range, but there are a handful of outliers at much larger values.  Perhaps this indicates the errors are not normally distributed, but it may also mean a few of the uncertainties are too low.  This is an important reminder: the uncertainties are just estimates, not rigorous measurements.  Most of them are pretty good, but you should not put too much confidence in any single value.

%% Cell type:markdown id: tags:

# Congratulations on finishing the series! Time to join the Community!

Congratulations on completing this tutorial notebook! This is currently the last tutorial in the DeepChem introductory tutorial series. If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to join the DeepChem Community and get involved:

## Star DeepChem on GitHub
Starring DeepChem on GitHub helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

## Join the DeepChem Gitter
The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!
Loading