Unverified Commit dd73aac7 authored by Bharath Ramsundar's avatar Bharath Ramsundar Committed by GitHub
Browse files

Merge pull request #2169 from peastman/tutorials

Overhaul tutorial sequence
parents 9f3d3995 391071b6
Loading
Loading
Loading
Loading
+1 −1
Original line number Diff line number Diff line
@@ -84,7 +84,7 @@ def load_tox21(featurizer='ECFP',

  loader = deepchem.data.CSVLoader(
      tasks=tox21_tasks, feature_field="smiles", featurizer=featurizer)
  dataset = loader.featurize(dataset_file, shard_size=8192)
  dataset = loader.create_dataset(dataset_file, shard_size=8192)

  if split == None:
    # Initialize transformers
+373 −1408

File changed.

Preview size limit exceeded, changes collapsed.

+997 −0

File added.

Preview size limit exceeded, changes collapsed.

+0 −707

File deleted.

Preview size limit exceeded, changes collapsed.

+155 −0
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Tutorial 4: Molecular Fingerprints

Molecules can be represented in many ways.  This tutorial introduces a type of representation called a "molecular fingerprint".  It is a very simple representation that often works well for small drug-like molecules.

## Colab

This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/04_Molecular_Fingerprints.ipynb)


## Setup

To run DeepChem within Colab, you'll need to run the following installation commands. This will take about 5 minutes to run to completion and install your environment. You can of course run this tutorial locally if you prefer. In that case, don't run these cells since they will download and install Anaconda on your local machine.

%% Cell type:code id: tags:

``` python
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e
```

%% Cell type:code id: tags:

``` python
!pip install --pre deepchem
```

%% Cell type:markdown id: tags:

We can now import the `deepchem` package to play with.

%% Cell type:code id: tags:

``` python
import deepchem as dc
dc.__version__
```

%% Output

    '2.4.0-rc1.dev'

%% Cell type:markdown id: tags:

# What is a Fingerprint?

Deep learning models almost always take arrays of numbers as their inputs.  If we want to process molecules with them, we somehow need to represent each molecule as one or more arrays of numbers.

Many (but not all) types of models require their inputs to have a fixed size.  This can be a challenge for molecules, since different molecules have different numbers of atoms.  If we want to use these types of models, we somehow need to represent variable sized molecules with fixed sized arrays.

Fingerprints are designed to address these problems.  A fingerprint is a fixed length array, where different elements indicate the presence of different features in the molecule.  If two molecules have similar fingerprints, that indicates they contain many of the same features, and therefore will likely have similar chemistry.

DeepChem supports a particular type of fingerprint called an "Extended Connectivity Fingerprint", or "ECFP" for short.  They also are sometimes called "circular fingerprints".  The ECFP algorithm begins by classifying atoms based only on their direct properties and bonds.  Each unique pattern is a feature.  For example, "carbon atom bonded to two hydrogens and two heavy atoms" would be a feature, and a particular element of the fingerprint is set to 1 for any molecule that contains that feature.  It then iteratively identifies new features by looking at larger circular neighborhoods.  One specific feature bonded to two other specific features becomes a higher level feature, and the corresponding element is set for any molecule that contains it.  This continues for a fixed number of iterations, most often two.

Let's take a look at a dataset that has been featurized with ECFP.

%% Cell type:code id: tags:

``` python
tasks, datasets, transformers = dc.molnet.load_tox21(featurizer='ECFP')
train_dataset, valid_dataset, test_dataset = datasets
print(train_dataset)
```

%% Output

    <DiskDataset X.shape: (6264, 1024), y.shape: (6264, 12), w.shape: (6264, 12), task_names: ['NR-AR' 'NR-AR-LBD' 'NR-AhR' ... 'SR-HSE' 'SR-MMP' 'SR-p53']>

%% Cell type:markdown id: tags:

The feature array `X` has shape (6264, 1024).  That means there are 6264 samples in the training set.  Each one is represented by a fingerprint of length 1024.  Also notice that the label array `y` has shape (6264, 12): this is a multitask dataset.  Tox21 contains information about the toxicity of molecules.  12 different assays were used to look for signs of toxicity.  The dataset records the results of all 12 assays, each as a different task.

Let's also take a look at the weights array.

%% Cell type:code id: tags:

``` python
train_dataset.w
```

%% Output

    array([[1.0433141624730409, 1.0369942196531792, 8.53921568627451, ...,
            1.060388945752303, 1.1895710249165168, 1.0700990099009902],
           [1.0433141624730409, 1.0369942196531792, 1.1326397919375812, ...,
            0.0, 1.1895710249165168, 1.0700990099009902],
           [0.0, 0.0, 0.0, ..., 1.060388945752303, 0.0, 0.0],
           ...,
           [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
           [1.0433141624730409, 1.0369942196531792, 8.53921568627451, ...,
            1.060388945752303, 0.0, 0.0],
           [1.0433141624730409, 1.0369942196531792, 1.1326397919375812, ...,
            1.060388945752303, 1.1895710249165168, 1.0700990099009902]],
          dtype=object)

%% Cell type:markdown id: tags:

Notice that some elements are 0.  The weights are being used to indicate missing data.  Not all assays were actually performed on every molecule.  Setting the weight for a sample or sample/task pair to 0 causes it to be ignored during fitting and evaluation.  It will have no effect on the loss function or other metrics.

Most of the other weights are close to 1, but not exactly 1.  This is done to balance the overall weight of positive and negative samples on each task.  When training the model, we want each of the 12 tasks to contribute equally, and on each task we want to put equal weight on positive and negative samples.  Otherwise, the model might just learn that most of the training samples are non-toxic, and therefore become biased toward identifying other molecules as non-toxic.

# Training a Model on Fingerprints

Let's train a model.  In earlier tutorials we use `GraphConvModel`, which is a fairly complicated architecture that takes a complex set of inputs.  Because fingerprints are so simple, just a single fixed length array, we can use a much simpler type of model.

%% Cell type:code id: tags:

``` python
model = dc.models.MultitaskClassifier(n_tasks=12, n_features=1024, layer_sizes=[1000])
```

%% Cell type:markdown id: tags:

`MultitaskClassifier` is a simple stack of fully connected layers.  In this example we tell it to use a single hidden layer of width 1000.  We also tell it that each input will have 1024 features, and that it should produce predictions for 12 different tasks.

Why not train a separate model for each task?  We could do that, but it turns out that training a single model for multiple tasks often works better.  We will see an example of that in a later tutorial.

Let's train and evaluate the model.

%% Cell type:code id: tags:

``` python
import numpy as np

model.fit(train_dataset, nb_epoch=10)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
print('training set score:', model.evaluate(train_dataset, [metric], transformers))
print('test set score:', model.evaluate(test_dataset, [metric], transformers))
```

%% Output

    training set score: {'roc_auc_score': 0.9550063590563469}
    test set score: {'roc_auc_score': 0.7781819573695475}

%% Cell type:markdown id: tags:

Not bad performance for such a simple model and featurization.  More sophisticated models do slightly better on this dataset, but not enormously better.

%% Cell type:markdown id: tags:

# Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

## Join the DeepChem Gitter
The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!
Loading