Commit c72bf8ec authored by Arun's avatar Arun
Browse files

removed tutorial no. in notebook [skip ci]

parent aa4beb99
......@@ -7,7 +7,7 @@
"id": "tTuYGOlnh117"
},
"source": [
"# Tutorial Part 9: Advanced Model Training\n",
"# Advanced Model Training\n",
"\n",
"In the tutorials so far we have followed a simple procedure for training models: load a dataset, create a model, call `fit()`, evaluate it, and call ourselves done. That's fine for an example, but in real machine learning projects the process is usually more complicated. In this tutorial we will look at a more realistic workflow for training a model.\n",
"\n",
......
%% Cell type:markdown id: tags:
# Tutorial Part 9: Advanced Model Training
# Advanced Model Training
In the tutorials so far we have followed a simple procedure for training models: load a dataset, create a model, call `fit()`, evaluate it, and call ourselves done. That's fine for an example, but in real machine learning projects the process is usually more complicated. In this tutorial we will look at a more realistic workflow for training a model.
## Colab
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/09_Advanced_Model_Training.ipynb)
## Setup
To run DeepChem within Colab, you'll need to run the following installation commands. This will take about 5 minutes to run to completion and install your environment. You can of course run this tutorial locally if you prefer. In that case, don't run these cells since they will download and install Anaconda on your local machine.
%% Cell type:code id: tags:
``` python
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e
```
%% Cell type:code id: tags:
``` python
!pip install --pre deepchem
import deepchem
deepchem.__version__
```
%% Cell type:markdown id: tags:
## Hyperparameter Optimization
Let's start by loading the HIV dataset. It classifies over 40,000 molecules based on whether they inhibit HIV replication.
%% Cell type:code id: tags:
``` python
import deepchem as dc
tasks, datasets, transformers = dc.molnet.load_hiv(featurizer='ECFP', split='scaffold')
train_dataset, valid_dataset, test_dataset = datasets
```
%% Cell type:markdown id: tags:
Now let's train a model on it. We will use a `MultitaskClassifier`, which is just a stack of dense layers. But that still leaves a lot of options. How many layers should there be, and how wide should each one be? What dropout rate should we use? What learning rate?
These are called hyperparameters. The standard way to select them is to try lots of values, train each model on the training set, and evaluate it on the validation set. This lets us see which ones work best.
You could do that by hand, but usually it's easier to let the computer do it for you. DeepChem provides a selection of hyperparameter optimization algorithms, which are found in the `dc.hyper` package. For this example we'll use `GridHyperparamOpt`, which is the most basic method. We just give it a list of options for each hyperparameter and it exhaustively tries all combinations of them.
The lists of options are defined by a `dict` that we provide. For each of the model's arguments, we provide a list of values to try. In this example we consider three possible sets of hidden layers: a single layer of width 500, a single layer of width 1000, or two layers each of width 1000. We also consider two dropout rates (20% and 50%) and two learning rates (0.001 and 0.0001).
%% Cell type:code id: tags:
``` python
params_dict = {
'n_tasks': [len(tasks)],
'n_features': [1024],
'layer_sizes': [[500], [1000], [1000, 1000]],
'dropouts': [0.2, 0.5],
'learning_rate': [0.001, 0.0001]
}
optimizer = dc.hyper.GridHyperparamOpt(dc.models.MultitaskClassifier)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
best_model, best_hyperparams, all_results = optimizer.hyperparam_search(
params_dict, train_dataset, valid_dataset, metric, transformers)
```
%% Cell type:markdown id: tags:
`hyperparam_search()` returns three arguments: the best model it found, the hyperparameters for that model, and a full listing of the validation score for every model. Let's take a look at the last one.
%% Cell type:code id: tags:
``` python
all_results
```
%%%% Output: execute_result
{'_dropouts_0.200000_layer_sizes[500]_learning_rate_0.001000_n_features_1024_n_tasks_1': 0.759624393738977,
'_dropouts_0.200000_layer_sizes[500]_learning_rate_0.000100_n_features_1024_n_tasks_1': 0.7680791323731138,
'_dropouts_0.500000_layer_sizes[500]_learning_rate_0.001000_n_features_1024_n_tasks_1': 0.7623870149911817,
'_dropouts_0.500000_layer_sizes[500]_learning_rate_0.000100_n_features_1024_n_tasks_1': 0.7552282358416618,
'_dropouts_0.200000_layer_sizes[1000]_learning_rate_0.001000_n_features_1024_n_tasks_1': 0.7689915858318636,
'_dropouts_0.200000_layer_sizes[1000]_learning_rate_0.000100_n_features_1024_n_tasks_1': 0.7619292572996277,
'_dropouts_0.500000_layer_sizes[1000]_learning_rate_0.001000_n_features_1024_n_tasks_1': 0.7641491524593376,
'_dropouts_0.500000_layer_sizes[1000]_learning_rate_0.000100_n_features_1024_n_tasks_1': 0.7609877155594749,
'_dropouts_0.200000_layer_sizes[1000, 1000]_learning_rate_0.001000_n_features_1024_n_tasks_1': 0.770716980207721,
'_dropouts_0.200000_layer_sizes[1000, 1000]_learning_rate_0.000100_n_features_1024_n_tasks_1': 0.7750327625906329,
'_dropouts_0.500000_layer_sizes[1000, 1000]_learning_rate_0.001000_n_features_1024_n_tasks_1': 0.725972314079953,
'_dropouts_0.500000_layer_sizes[1000, 1000]_learning_rate_0.000100_n_features_1024_n_tasks_1': 0.7546280986674505}
%% Cell type:markdown id: tags:
We can see a few general patterns. Using two layers with the larger learning rate doesn't work very well. It seems the deeper model requires a smaller learning rate. We also see that 20% dropout usually works better than 50%. Once we narrow down the list of models based on these observations, all the validation scores are very close to each other, probably close enough that the remaining variation is mainly noise. It doesn't seem to make much difference which of the remaining hyperparameter sets we use, so let's arbitrarily pick a single layer of width 1000 and learning rate of 0.0001.
## Early Stopping
There is one other important hyperparameter we haven't considered yet: how long we train the model for. `GridHyperparamOpt` trains each for a fixed, fairly small number of epochs. That isn't necessarily the best number.
You might expect that the longer you train, the better your model will get, but that isn't usually true. If you train too long, the model will usually start overfitting to irrelevant details of the training set. You can tell when this happens because the validation set score stops increasing and may even decrease, while the score on the training set continues to improve.
Fortunately, we don't need to train lots of different models for different numbers of steps to identify the optimal number. We just train it once, monitor the validation score, and keep whichever parameters maximize it. This is called "early stopping". DeepChem's `ValidationCallback` class can do this for us automatically. In the example below, we have it compute the validation set's ROC AUC every 1000 training steps. If you add the `save_dir` argument, it will also save a copy of the best model parameters to disk.
%% Cell type:code id: tags:
``` python
model = dc.models.MultitaskClassifier(n_tasks=len(tasks),
n_features=1024,
layer_sizes=[1000],
dropouts=0.2,
learning_rate=0.0001)
callback = dc.models.ValidationCallback(valid_dataset, 1000, metric)
model.fit(train_dataset, nb_epoch=50, callbacks=callback)
```
%%%% Output: stream
Step 1000 validation: roc_auc_score=0.759757
Step 2000 validation: roc_auc_score=0.770685
Step 3000 validation: roc_auc_score=0.771588
Step 4000 validation: roc_auc_score=0.777862
Step 5000 validation: roc_auc_score=0.773894
Step 6000 validation: roc_auc_score=0.763762
Step 7000 validation: roc_auc_score=0.766361
Step 8000 validation: roc_auc_score=0.767026
Step 9000 validation: roc_auc_score=0.761239
Step 10000 validation: roc_auc_score=0.761279
Step 11000 validation: roc_auc_score=0.765363
Step 12000 validation: roc_auc_score=0.769481
Step 13000 validation: roc_auc_score=0.768523
Step 14000 validation: roc_auc_score=0.761306
Step 15000 validation: roc_auc_score=0.77397
Step 16000 validation: roc_auc_score=0.764848
%%%% Output: execute_result
0.8040038299560547
%% Cell type:markdown id: tags:
## Learning Rate Schedules
In the examples above we use a fixed learning rate throughout training. In some cases it works better to vary the learning rate during training. To do this in DeepChem, we simply specify a `LearningRateSchedule` object instead of a number for the `learning_rate` argument. In the following example we use a learning rate that decreases exponentially. It starts at 0.0002, then gets multiplied by 0.9 after every 1000 steps.
%% Cell type:code id: tags:
``` python
learning_rate = dc.models.optimizers.ExponentialDecay(0.0002, 0.9, 1000)
model = dc.models.MultitaskClassifier(n_tasks=len(tasks),
n_features=1024,
layer_sizes=[1000],
dropouts=0.2,
learning_rate=learning_rate)
model.fit(train_dataset, nb_epoch=50, callbacks=callback)
```
%%%% Output: stream
Step 1000 validation: roc_auc_score=0.736547
Step 2000 validation: roc_auc_score=0.758979
Step 3000 validation: roc_auc_score=0.768361
Step 4000 validation: roc_auc_score=0.764898
Step 5000 validation: roc_auc_score=0.775253
Step 6000 validation: roc_auc_score=0.779898
Step 7000 validation: roc_auc_score=0.76991
Step 8000 validation: roc_auc_score=0.771515
Step 9000 validation: roc_auc_score=0.773796
Step 10000 validation: roc_auc_score=0.776977
Step 11000 validation: roc_auc_score=0.778866
Step 12000 validation: roc_auc_score=0.777066
Step 13000 validation: roc_auc_score=0.77616
Step 14000 validation: roc_auc_score=0.775646
Step 15000 validation: roc_auc_score=0.772785
Step 16000 validation: roc_auc_score=0.769975
%%%% Output: execute_result
0.22854619979858398
%% Cell type:markdown id: tags:
# Congratulations! Time to join the Community!
Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:
## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.
## Join the DeepChem Gitter
The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!
......
......@@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial 3: An Introduction To MoleculeNet\n",
"# An Introduction To MoleculeNet\n",
"\n",
"One of the most powerful features of DeepChem is that it comes \"batteries included\" with datasets to use. The DeepChem developer community maintains the MoleculeNet [1] suite of datasets which maintains a large collection of different scientific datasets for use in machine learning applications. The original MoleculeNet suite had 17 datasets mostly focused on molecular properties. Over the last several years, MoleculeNet has evolved into a broader collection of scientific datasets to facilitate the broad use and development of scientific machine learning tools.\n",
"\n",
......
%% Cell type:markdown id: tags:
# Tutorial 3: An Introduction To MoleculeNet
# An Introduction To MoleculeNet
One of the most powerful features of DeepChem is that it comes "batteries included" with datasets to use. The DeepChem developer community maintains the MoleculeNet [1] suite of datasets which maintains a large collection of different scientific datasets for use in machine learning applications. The original MoleculeNet suite had 17 datasets mostly focused on molecular properties. Over the last several years, MoleculeNet has evolved into a broader collection of scientific datasets to facilitate the broad use and development of scientific machine learning tools.
These datasets are integrated with the rest of the DeepChem suite so you can conveniently access these these through functions in the `dc.molnet` submodule. You've already seen a few examples of these loaders already as you've worked through the tutorial series. The full documentation for the MoleculeNet suite is available in our docs [2].
[1] Wu, Zhenqin, et al. "MoleculeNet: a benchmark for molecular machine learning." Chemical science 9.2 (2018): 513-530.
[2] https://deepchem.readthedocs.io/en/latest/moleculenet.html
## Colab
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/03_An_Introduction_To_MoleculeNet.ipynb)
## Setup
To run DeepChem within Colab, you'll need to run the following installation commands. This will take about 5 minutes to run to completion and install your environment. You can of course run this tutorial locally if you prefer. In that case, don't run these cells since they will download and install Anaconda on your local machine.
%% Cell type:code id: tags:
``` python
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e
```
%% Cell type:code id: tags:
``` python
!pip install --pre deepchem
```
%% Cell type:markdown id: tags:
We can now import the `deepchem` package to play with.
%% Cell type:code id: tags:
``` python
import deepchem as dc
dc.__version__
```
%%%% Output: execute_result
'2.4.0-rc1.dev'
%% Cell type:markdown id: tags:
# MoleculeNet Overview
In the last two tutorials we loaded the Delaney dataset of molecular solubilities. Let's load it one more time.
%% Cell type:code id: tags:
``` python
tasks, datasets, transformers = dc.molnet.load_delaney(featurizer='GraphConv', splitter='random')
```
%% Cell type:markdown id: tags:
Notice that the loader function we invoke `dc.molnet.load_delaney` lives in the `dc.molnet` submodule of MoleculeNet loaders. Let's take a look at the full collection of loaders available for us
%% Cell type:code id: tags:
``` python
[method for method in dir(dc.molnet) if "load_" in method ]
```
%%%% Output: execute_result
['load_bace_classification',
'load_bace_regression',
'load_bandgap',
'load_bbbc001',
'load_bbbc002',
'load_bbbp',
'load_cell_counting',
'load_chembl',
'load_chembl25',
'load_clearance',
'load_clintox',
'load_delaney',
'load_factors',
'load_function',
'load_hiv',
'load_hopv',
'load_hppb',
'load_kaggle',
'load_kinase',
'load_lipo',
'load_mp_formation_energy',
'load_mp_metallicity',
'load_muv',
'load_nci',
'load_pcba',
'load_pcba_146',
'load_pcba_2475',
'load_pdbbind',
'load_pdbbind_from_dir',
'load_pdbbind_grid',
'load_perovskite',
'load_ppb',
'load_qm7',
'load_qm7_from_mat',
'load_qm7b_from_mat',
'load_qm8',
'load_qm9',
'load_sampl',
'load_sider',
'load_sweet',
'load_thermosol',
'load_tox21',
'load_toxcast',
'load_uspto',
'load_uv',
'load_zinc15']
%% Cell type:markdown id: tags:
The set of MoleculeNet loaders is actively maintained by the DeepChem community and we work on adding new datasets to the collection. Let's see how many datasets there are in MoleculeNet today
%% Cell type:code id: tags:
``` python
len([method for method in dir(dc.molnet) if "load_" in method ])
```
%%%% Output: execute_result
46
%% Cell type:markdown id: tags:
# MoleculeNet Dataset Categories
There's a lot of different datasets in MoleculeNet. Let's do a quick overview of the different types of datasets available. We'll break datasets into different categories and list loaders which belong to those categories. More details on each of these datasets can be found at https://deepchem.readthedocs.io/en/latest/moleculenet.html. The original MoleculeNet paper [1] provides details about a subset of these papers. We've marked these datasets as "V1" below. All remaining dataset are "V2" and not documented in the older paper.
## Quantum Mechanical Datasets
MoleculeNet's quantum mechanical datasets contain various quantum mechanical property prediction tasks. The current set of quantum mechanical datasets includes QM7, QM7b, QM8, QM9. The associated loaders are
- [`dc.molnet.load_qm7`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_qm7): V1
- [`dc.molnet.load_qm7b_from_mat`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_qm7): V1
- [`dc.molnet.load_qm8`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_qm8): V1
- [`dc.molnet.load_qm9`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_qm9): V1
## Physical Chemistry Datasets
The physical chemistry dataset collection contain a variety of tasks for predicting various physical properties of molecules.
- [`dc.molnet.load_delaney`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_delaney): V1. This dataset is also referred to as ESOL in the original paper.
- [`dc.molnet.load_sampl`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_sampl): V1. This dataset is also referred to as FreeSolv in the original paper.
- [`dc.molnet.load_lipo`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_lipo): V1. This dataset is also referred to as Lipophilicity in the original paper.
- [`dc.molnet.load_thermosol`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_thermosol): V2.
- [`dc.molnet.load_hppb`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_hppb): V2.
- [`dc.molnet.load_hopv`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_hopv): V2. This dataset is drawn from a recent publication [3]
## Chemical Reaction Datasets
These datasets hold chemical reaction datasets for use in computational retrosynthesis / forward synthesis.
- [`dc.molnet.load_uspto`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_uspto)
## Biochemical/Biophysical Datasets
These datasets are drawn from various biochemical/biophysical datasets that measure things like the binding affinity of compounds to proteins.
- [`dc.molnet.load_pcba`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_pcba): V1
- [`dc.molnet.load_nci`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_nci): V2.
- [`dc.molnet.load_muv`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_muv): V1
- [`dc.molnet.load_hiv`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_hiv): V1
- [`dc.molnet.load_ppb`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#ppb-datasets): V2.
- [`dc.molnet.load_bace_classification`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_bace_classification): V1. This loader loads the classification task for the BACE dataset from the original MoleculeNet paper.
- [`dc.molnet.load_bace_regression`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_bace_regression): V1. This loader loads the regression task for the BACE dataset from the original MoleculeNet paper.
- [`dc.molnet.load_kaggle`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_kaggle): V2. This dataset is from Merck's drug discovery kaggle contest and is described in [4].
- [`dc.molnet.load_factors`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_factors): V2. This dataset is from [4].
- [`dc.molnet.load_uv`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_uv): V2. This dataset is from [4].
- [`dc.molnet.load_kinase`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_kinase): V2. This datset is from [4].
## Molecular Catalog Datasets
These datasets provide molecular datasets which have no associated properties beyond the raw SMILES formula or structure. These types of datasets are useful for generative modeling tasks.
- [`dc.molnet.load_zinc15`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_zinc15): V2
- [`dc.molnet.load_chembl`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_chembl): V2
- [`dc.molnet.load_chembl25`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#chembl25-datasets): V2
## Physiology Datasets
These datasets measure physiological properties of how molecules interact with human patients.
- [`dc.molnet.load_bbbp`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_bbbp): V1
- [`dc.molnet.load_tox21`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_tox21): V1
- [`dc.molnet.load_toxcast`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_toxcast): V1
- [`dc.molnet.load_sider`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_sider): V1
- [`dc.molnet.load_clintox`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_clintox): V1
- [`dc.molnet.load_clearance`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_clearance): V2.
## Structural Biology Datasets
These datasets contain 3D structures of macromolecules along with associated properties.
- [`dc.molnet.load_pdbbind`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_pdbbind): V1
## Microscopy Datasets
These datasets contain microscopy image datasets, typically of cell lines. These datasets were not in the original MoleculeNet paper.
- [`dc.molnet.load_bbbc001`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_bbbc001): V2
- [`dc.molnet.load_bbbc002`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_bbbc002): V2
- [`dc.molnet.load_cell_counting`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#cell-counting-datasets): V2
## Materials Properties Datasets
These datasets compute properties of various materials.
- [`dc.molnet.load_bandgap`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_bandgap): V2
- [`dc.molnet.load_perovskite`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_perovskite): V2
- [`dc.molnet.load_mp_formation_energy`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_mp_formation_energy): V2
- [`dc.molnet.load_mp_metallicity`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_mp_metallicity): V2
[3] Lopez, Steven A., et al. "The Harvard organic photovoltaic dataset." Scientific data 3.1 (2016): 1-7.
[4] Ramsundar, Bharath, et al. "Is multitask deep learning practical for pharma?." Journal of chemical information and modeling 57.8 (2017): 2068-2076.
%% Cell type:markdown id: tags:
# MoleculeNet Loaders Explained
%% Cell type:markdown id: tags:
All MoleculeNet loader functions take the form `dc.molnet.load_X`. Loader functions return a tuple of arguments `(tasks, datasets, transformers)`. Let's walk through each of these return values and explain what we get:
1. `tasks`: This is a list of task-names. Many datasets in MoleculeNet are "multitask". That is, a given datapoint has multiple labels associated with it. These correspond to different measurements or values associated with this datapoint.
2. `datasets`: This field is a tuple of three `dc.data.Dataset` objects `(train, valid, test)`. These correspond to the training, validation, and test set for this MoleculeNet dataset.
3. `transformers`: This field is a list of `dc.trans.Transformer` objects which were applied to this dataset during processing.
This is abstract so let's take a look at each of these fields for the `dc.molnet.load_delaney` function we invoked above. Let's start with `tasks`.
%% Cell type:code id: tags:
``` python
tasks
```
%%%% Output: execute_result
['measured log solubility in mols per litre']
%% Cell type:markdown id: tags:
We have one task in this dataset which corresponds to the measured log solubility in mol/L. Let's now take a look at `datasets`:
%% Cell type:code id: tags:
``` python
datasets
```
%%%% Output: execute_result
(<DiskDataset X.shape: (902,), y.shape: (902, 1), w.shape: (902, 1), ids: ['CCC(C)Cl' 'O=C1NC(=O)NC(=O)C1(C(C)C)CC=C' 'Oc1ccccn1' ...
'CCCCCCCC(=O)OCC' 'O=Cc1ccccc1' 'CCCC=C(CC)C=O'], task_names: ['measured log solubility in mols per litre']>,
<DiskDataset X.shape: (113,), y.shape: (113, 1), w.shape: (113, 1), ids: ['CSc1nc(nc(n1)N(C)C)N(C)C' 'CC#N' 'CCCCCCCC#C' ... 'ClCCBr'
'CCN(CC)C(=O)CSc1ccc(Cl)nn1' 'CC(=O)OC3CCC4C2CCC1=CC(=O)CCC1(C)C2CCC34C '], task_names: ['measured log solubility in mols per litre']>,
<DiskDataset X.shape: (113,), y.shape: (113, 1), w.shape: (113, 1), ids: ['CCCCc1c(C)nc(nc1O)N(C)C '
'Cc3cc2nc1c(=O)[nH]c(=O)nc1n(CC(O)C(O)C(O)CO)c2cc3C'
'CSc1nc(NC(C)C)nc(NC(C)C)n1' ... 'O=c1[nH]cnc2[nH]ncc12 '
'CC(=C)C1CC=C(C)C(=O)C1' 'OC(C(=O)c1ccccc1)c2ccccc2'], task_names: ['measured log solubility in mols per litre']>)
%% Cell type:markdown id: tags:
As we mentioned previously, we see that `datasets` is a tuple of 3 datasets. Let's split them out.
%% Cell type:code id: tags:
``` python
train, valid, test = datasets
```
%% Cell type:code id: tags:
``` python
train
```
%%%% Output: execute_result
<DiskDataset X.shape: (902,), y.shape: (902, 1), w.shape: (902, 1), ids: ['CCC(C)Cl' 'O=C1NC(=O)NC(=O)C1(C(C)C)CC=C' 'Oc1ccccn1' ...
'CCCCCCCC(=O)OCC' 'O=Cc1ccccc1' 'CCCC=C(CC)C=O'], task_names: ['measured log solubility in mols per litre']>
%% Cell type:code id: tags:
``` python
valid
```
%%%% Output: execute_result
<DiskDataset X.shape: (113,), y.shape: (113, 1), w.shape: (113, 1), ids: ['CSc1nc(nc(n1)N(C)C)N(C)C' 'CC#N' 'CCCCCCCC#C' ... 'ClCCBr'
'CCN(CC)C(=O)CSc1ccc(Cl)nn1' 'CC(=O)OC3CCC4C2CCC1=CC(=O)CCC1(C)C2CCC34C '], task_names: ['measured log solubility in mols per litre']>
%% Cell type:code id: tags:
``` python
test
```
%%%% Output: execute_result
<DiskDataset X.shape: (113,), y.shape: (113, 1), w.shape: (113, 1), ids: ['CCCCc1c(C)nc(nc1O)N(C)C '
'Cc3cc2nc1c(=O)[nH]c(=O)nc1n(CC(O)C(O)C(O)CO)c2cc3C'
'CSc1nc(NC(C)C)nc(NC(C)C)n1' ... 'O=c1[nH]cnc2[nH]ncc12 '
'CC(=C)C1CC=C(C)C(=O)C1' 'OC(C(=O)c1ccccc1)c2ccccc2'], task_names: ['measured log solubility in mols per litre']>
%% Cell type:markdown id: tags:
Let's peek into one of the datapoints in the `train` dataset.
%% Cell type:code id: tags:
``` python
train.X[0]
```
%%%% Output: execute_result
<deepchem.feat.mol_graphs.ConvMol at 0x7fe1ef601438>
%% Cell type:markdown id: tags:
Note that this is a `dc.feat.mol_graphs.ConvMol` object produced by `dc.feat.ConvMolFeaturizer`. We'll say more about how to control choice of featurization shortly. Finally let's take a look at the `transformers` field:
%% Cell type:code id: tags:
``` python
transformers
```
%%%% Output: execute_result
[<deepchem.trans.transformers.NormalizationTransformer at 0x7fe2029bdfd0>]
%% Cell type:markdown id: tags:
So we see that one transformer was applied, the `dc.trans.NormalizationTransformer`.
After reading through this description so far, you may be wondering what choices are made under the hood. As we've briefly mentioned previously, datasets can be processed with different choices of "featurizers". Can we control the choice of featurization here? In addition, how was the source dataset split into train/valid/test as three different datasets?
You can use the 'featurizer' and 'splitter' keyword arguments and pass in different strings. Common possible choices for 'featurizer' are 'ECFP', 'GraphConv', 'Weave' and 'smiles2img' corresponding to the `dc.feat.CircularFingerprint`, `dc.feat.ConvMolFeaturizer`, `dc.feat.WeaveFeaturizer` and `dc.feat.SmilesToImage` featurizers. Common possible choices for 'splitter' are `None`, 'index', 'random', 'scaffold' and 'stratified' corresponding to no split, `dc.splits.IndexSplitter`, `dc.splits.RandomSplitter`, `dc.splits.SingletaskStratifiedSplitter`. We haven't talked much about splitters yet, but intuitively they're a way to partition a dataset based on different criteria. We'll say more in a future tutorial.
Instead of a string, you also can pass in any `Featurizer` or `Splitter` object. This is very useful when, for example, a Featurizer has constructor arguments you can use to customize its behavior.
%% Cell type:code id: tags:
``` python
tasks, datasets, transformers = dc.molnet.load_delaney(featurizer="ECFP", splitter="scaffold")
```
%% Cell type:code id: tags:
``` python
(train, valid, test) = datasets
```
%% Cell type:code id: tags:
``` python
train
```
%%%% Output: execute_result
<DiskDataset X.shape: (902, 1024), y.shape: (902, 1), w.shape: (902, 1), ids: ['CC(C)=CCCC(C)=CC(=O)' 'CCCC=C' 'CCCCCCCCCCCCCC' ...
'Nc2cccc3nc1ccccc1cc23 ' 'C1CCCCCC1' 'OC1CCCCCC1'], task_names: ['measured log solubility in mols per litre']>
%% Cell type:code id: tags:
``` python
train.X[0]
```
%%%% Output: execute_result
array([0., 0., 0., ..., 0., 0., 0.])
%% Cell type:markdown id: tags:
Note that unlike the earlier invocation we have numpy arrays produced by `dc.feat.CircularFingerprint` instead of `ConvMol` objects produced by `dc.feat.ConvMolFeaturizer`.
Give it a try for yourself. Try invoking MoleculeNet to load some other datasets and experiment with dfiferent featurizer/split options and see what happens!
%% Cell type:markdown id: tags:
# Congratulations! Time to join the Community!
Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:
## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.
## Join the DeepChem Gitter
The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!
......
......@@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# 28. Calculating Atomic Contributions for Molecules Based on a Graph Convolutional QSAR Model\n",
"# Calculating Atomic Contributions for Molecules Based on a Graph Convolutional QSAR Model\n",
"\n",
"In an earlier tutorial we introduced the concept of model interpretability: understanding why a model produced the result it did. In this tutorial we will learn about atomic contributions, a useful tool for interpreting models that operate on molecules.\n",
"\n",
......@@ -973,7 +973,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
......@@ -987,7 +987,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
"version": "3.8.5"
}
},
"nbformat": 4,
......@@ -7,7 +7,7 @@
"id": "gG-V_KZzqSSr"
},
"source": [
"# Tutorial Part 14: Conditional Generative Adversarial Network\n",
"# Conditional Generative Adversarial Network\n",
"\n",
"A Generative Adversarial Network (GAN) is a type of generative model. It consists of two parts called the \"generator\" and the \"discriminator\". The generator takes random values as input and transforms them into an output that (hopefully) resembles the training data. The discriminator takes a set of samples as input and tries to distinguish the real training samples from the ones created by the generator. Both of them are trained together. The discriminator tries to get better and better at telling real from false data, while the generator tries to get better and better at fooling the discriminator.\n",
"\n",
......
%% Cell type:markdown id: tags:
# Tutorial Part 14: Conditional Generative Adversarial Network
# Conditional Generative Adversarial Network
A Generative Adversarial Network (GAN) is a type of generative model. It consists of two parts called the "generator" and the "discriminator". The generator takes random values as input and transforms them into an output that (hopefully) resembles the training data. The discriminator takes a set of samples as input and tries to distinguish the real training samples from the ones created by the generator. Both of them are trained together. The discriminator tries to get better and better at telling real from false data, while the generator tries to get better and better at fooling the discriminator.
A Conditional GAN (CGAN) allows additional inputs to the generator and discriminator that their output is conditioned on. For example, this might be a class label, and the GAN tries to learn how the data distribution varies between classes.
## Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/14_Conditional_Generative_Adversarial_Networks.ipynb)
## Setup
To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.
%% Cell type:code id: tags:
``` python
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e
```
%% Cell type:code id: tags:
``` python
!pip install --pre deepchem
import deepchem
deepchem.__version__
```
%% Cell type:markdown id: tags:
For this example, we will create a data distribution consisting of a set of ellipses in 2D, each with a random position, shape, and orientation. Each class corresponds to a different ellipse. Let's randomly generate the ellipses. For each one we select a random center position, X and Y size, and rotation angle. We then create a transformation matrix that maps the unit circle to the ellipse.
%% Cell type:code id: tags:
``` python
import deepchem as dc
import numpy as np
import tensorflow as tf
n_classes = 4
class_centers = np.random.uniform(-4, 4, (n_classes, 2))
class_transforms = []
for i in range(n_classes):
xscale = np.random.uniform(0.5, 2)
yscale = np.random.uniform(0.5, 2)
angle = np.random.uniform(0, np.pi)
m = [[xscale*np.cos(angle), -yscale*np.sin(angle)],
[xscale*np.sin(angle), yscale*np.cos(angle)]]
class_transforms.append(m)
class_transforms = np.array(class_transforms)
```
%% Cell type:markdown id: tags:
This function generates random data from the distribution. For each point it chooses a random class, then a random position in that class' ellipse.
%% Cell type:code id: tags:
``` python
def generate_data(n_points):
classes = np.random.randint(n_classes, size=n_points)
r = np.random.random(n_points)
angle = 2*np.pi*np.random.random(n_points)
points = (r*np.array([np.cos(angle), np.sin(angle)])).T
points = np.einsum('ijk,ik->ij', class_transforms[classes], points)
points += class_centers[classes]
return classes, points
```
%% Cell type:markdown id: tags:
Let's plot a bunch of random points drawn from this distribution to see what it looks like. Points are colored based on their class label.
%% Cell type:code id: tags:
``` python
%matplotlib inline
import matplotlib.pyplot as plot
classes, points = generate_data(1000)
plot.scatter(x=points[:,0], y=points[:,1], c=classes)
```
%%%% Output: execute_result
<matplotlib.collections.PathCollection at 0x1584692d0>
%%%% Output: display_data
%% Cell type:markdown id: tags:
Now let's create the model for our CGAN. DeepChem's GAN class makes this very easy. We just subclass it and implement a few methods. The two most important are:
- `create_generator()` constructs a model implementing the generator. The model takes as input a batch of random noise plus any condition variables (in our case, the one-hot encoded class of each sample). Its output is a synthetic sample that is supposed to resemble the training data.
- `create_discriminator()` constructs a model implementing the discriminator. The model takes as input the samples to evaluate (which might be either real training data or synthetic samples created by the generator) and the condition variables. Its output is a single number for each sample, which will be interpreted as the probability that the sample is real training data.
In this case, we use very simple models. They just concatenate the inputs together and pass them through a few dense layers. Notice that the final layer of the discriminator uses a sigmoid activation. This ensures it produces an output between 0 and 1 that can be interpreted as a probability.
We also need to implement a few methods that define the shapes of the various inputs. We specify that the random noise provided to the generator should consist of ten numbers for each sample; that each data sample consists of two numbers (the X and Y coordinates of a point in 2D); and that the conditional input consists of `n_classes` numbers for each sample (the one-hot encoded class index).
%% Cell type:code id: tags:
``` python
from tensorflow.keras.layers import Concatenate, Dense, Input
class ExampleGAN(dc.models.GAN):
def get_noise_input_shape(self):
return (10,)
def get_data_input_shapes(self):
return [(2,)]
def get_conditional_input_shapes(self):
return [(n_classes,)]
def create_generator(self):
noise_in = Input(shape=(10,))
conditional_in = Input(shape=(n_classes,))
gen_in = Concatenate()([noise_in, conditional_in])
gen_dense1 = Dense(30, activation=tf.nn.relu)(gen_in)
gen_dense2 = Dense(30, activation=tf.nn.relu)(gen_dense1)
generator_points = Dense(2)(gen_dense2)
return tf.keras.Model(inputs=[noise_in, conditional_in], outputs=[generator_points])
def create_discriminator(self):
data_in = Input(shape=(2,))
conditional_in = Input(shape=(n_classes,))
discrim_in = Concatenate()([data_in, conditional_in])
discrim_dense1 = Dense(30, activation=tf.nn.relu)(discrim_in)
discrim_dense2 = Dense(30, activation=tf.nn.relu)(discrim_dense1)
discrim_prob = Dense(1, activation=tf.sigmoid)(discrim_dense2)
return tf.keras.Model(inputs=[data_in, conditional_in], outputs=[discrim_prob])
gan = ExampleGAN(learning_rate=1e-4)
```
%% Cell type:markdown id: tags:
Now to fit the model. We do this by calling `fit_gan()`. The argument is an iterator that produces batches of training data. More specifically, it needs to produce dicts that map all data inputs and conditional inputs to the values to use for them. In our case we can easily create as much random data as we need, so we define a generator that calls the `generate_data()` function defined above for each new batch.
%% Cell type:code id: tags:
``` python
def iterbatches(batches):
for i in range(batches):
classes, points = generate_data(gan.batch_size)
classes = dc.metrics.to_one_hot(classes, n_classes)
yield {gan.data_inputs[0]: points, gan.conditional_inputs[0]: classes}
gan.fit_gan(iterbatches(5000))
```
%%%% Output: stream
Ending global_step 999: generator average loss 0.87121, discriminator average loss 1.08472
Ending global_step 1999: generator average loss 0.968357, discriminator average loss 1.17393
Ending global_step 2999: generator average loss 0.710444, discriminator average loss 1.37858
Ending global_step 3999: generator average loss 0.699195, discriminator average loss 1.38131
Ending global_step 4999: generator average loss 0.694203, discriminator average loss 1.3871
TIMING: model fitting took 31.352 s
%% Cell type:markdown id: tags:
Have the trained model generate some data, and see how well it matches the training distribution we plotted before.
%% Cell type:code id: tags:
``` python
classes, points = generate_data(1000)
one_hot_classes = dc.metrics.to_one_hot(classes, n_classes)
gen_points = gan.predict_gan_generator(conditional_inputs=[one_hot_classes])
plot.scatter(x=gen_points[:,0], y=gen_points[:,1], c=classes)
```
%%%% Output: execute_result
<matplotlib.collections.PathCollection at 0x160dedf50>
%%%% Output: display_data
%% Cell type:markdown id: tags:
# Congratulations! Time to join the Community!
Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:
## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.
## Join the DeepChem Gitter
The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!
......
......@@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial Part 5: Creating Models with TensorFlow and PyTorch\n",
"# Creating Models with TensorFlow and PyTorch\n",
"\n",
"In the tutorials so far, we have used standard models provided by DeepChem. This is fine for many applications, but sooner or later you will want to create an entirely new model with an architecture you define yourself. DeepChem provides integration with both TensorFlow (Keras) and PyTorch, so you can use it with models from either of these frameworks.\n",
"\n",
......
%% Cell type:markdown id: tags:
# Tutorial Part 5: Creating Models with TensorFlow and PyTorch
# Creating Models with TensorFlow and PyTorch
In the tutorials so far, we have used standard models provided by DeepChem. This is fine for many applications, but sooner or later you will want to create an entirely new model with an architecture you define yourself. DeepChem provides integration with both TensorFlow (Keras) and PyTorch, so you can use it with models from either of these frameworks.
## Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/05_Creating_Models_with_TensorFlow_and_PyTorch.ipynb)
## Setup
To run DeepChem within Colab, you'll need to run the following installation commands. This will take about 5 minutes to run to completion and install your environment. You can of course run this tutorial locally if you prefer. In that case, don't run these cells since they will download and install Anaconda on your local machine.
%% Cell type:code id: tags:
``` python
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e
```
%% Cell type:code id: tags:
``` python
!pip install --pre deepchem
```
%% Cell type:markdown id: tags:
There are actually two different approaches you can take to using TensorFlow or PyTorch models with DeepChem. It depends on whether you want to use TensorFlow/PyTorch APIs or DeepChem APIs for training and evaluating your model. For the former case, DeepChem's `Dataset` class has methods for easily adapting it to use with other frameworks. `make_tf_dataset()` returns a `tensorflow.data.Dataset` object that iterates over the data. `make_pytorch_dataset()` returns a `torch.utils.data.IterableDataset` that iterates over the data. This lets you use DeepChem's datasets, loaders, featurizers, transformers, splitters, etc. and easily integrate them into your existing TensorFlow or PyTorch code.
But DeepChem also provides many other useful features. The other approach, which lets you use those features, is to wrap your model in a DeepChem `Model` object. Let's look at how to do that.
## KerasModel
`KerasModel` is a subclass of DeepChem's `Model` class. It acts as a wrapper around a `tensorflow.keras.Model`. Let's see an example of using it. For this example, we create a simple sequential model consisting of two dense layers.
%% Cell type:code id: tags:
``` python
import deepchem as dc
import tensorflow as tf
keras_model = tf.keras.Sequential([
tf.keras.layers.Dense(1000, activation='relu'),
tf.keras.layers.Dropout(rate=0.5),
tf.keras.layers.Dense(1)
])
model = dc.models.KerasModel(keras_model, dc.models.losses.L2Loss())
```
%% Cell type:markdown id: tags:
For this example, we used the Keras `Sequential` class. Our model consists of a dense layer with ReLU activation, 50% dropout to provide regularization, and a final layer that produces a scalar output. We also need to specify the loss function to use when training the model, in this case L<sub>2</sub> loss. We can now train and evaluate the model exactly as we would with any other DeepChem model. For example, let's load the Delaney solubility dataset. How does our model do at predicting the solubilities of molecules based on their extended-connectivity fingerprints (ECFPs)?
%% Cell type:code id: tags:
``` python
tasks, datasets, transformers = dc.molnet.load_delaney(featurizer='ECFP', splitter='random')
train_dataset, valid_dataset, test_dataset = datasets
model.fit(train_dataset, nb_epoch=50)
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
print('training set score:', model.evaluate(train_dataset, [metric]))
print('test set score:', model.evaluate(test_dataset, [metric]))
```
%%%% Output: stream
training set score: {'pearson_r2_score': 0.9787445597470444}
test set score: {'pearson_r2_score': 0.736905850092889}
%% Cell type:markdown id: tags:
## TorchModel
`TorchModel` works just like `KerasModel`, except it wraps a `torch.nn.Module`. Let's use PyTorch to create another model just like the previous one and train it on the same data.
%% Cell type:code id: tags:
``` python
import torch
pytorch_model = torch.nn.Sequential(
torch.nn.Linear(1024, 1000),
torch.nn.ReLU(),
torch.nn.Dropout(0.5),
torch.nn.Linear(1000, 1)
)
model = dc.models.TorchModel(pytorch_model, dc.models.losses.L2Loss())
model.fit(train_dataset, nb_epoch=50)
print('training set score:', model.evaluate(train_dataset, [metric]))
print('test set score:', model.evaluate(test_dataset, [metric]))
```
%%%% Output: stream
training set score: {'pearson_r2_score': 0.9798256761766225}
test set score: {'pearson_r2_score': 0.7256745385608444}
%% Cell type:markdown id: tags:
## Computing Losses
Now let's see a more advanced example. In the above models, the loss was computed directly from the model's output. Often that is fine, but not always. Consider a classification model that outputs a probability distribution. While it is possible to compute the loss from the probabilities, it is more numerically stable to compute it from the logits.
To do this, we create a model that returns multiple outputs, both probabilities and logits. `KerasModel` and `TorchModel` let you specify a list of "output types". If a particular output has type `'prediction'`, that means it is a normal output that should be returned when you call `predict()`. If it has type `'loss'`, that means it should be passed to the loss function in place of the normal outputs.
Sequential models do not allow multiple outputs, so instead we use a subclassing style model.
%% Cell type:code id: tags:
``` python
class ClassificationModel(tf.keras.Model):
def __init__(self):
super(ClassificationModel, self).__init__()
self.dense1 = tf.keras.layers.Dense(1000, activation='relu')
self.dense2 = tf.keras.layers.Dense(1)
def call(self, inputs, training=False):
y = self.dense1(inputs)
if training:
y = tf.nn.dropout(y, 0.5)
logits = self.dense2(y)
output = tf.nn.sigmoid(logits)
return output, logits
keras_model = ClassificationModel()
output_types = ['prediction', 'loss']
model = dc.models.KerasModel(keras_model, dc.models.losses.SigmoidCrossEntropy(), output_types=output_types)
```
%% Cell type:markdown id: tags:
We can train our model on the BACE dataset. This is a binary classification task that tries to predict whether a molecule will inhibit the enzyme BACE-1.
%% Cell type:code id: tags:
``` python
tasks, datasets, transformers = dc.molnet.load_bace_classification(feturizer='ECFP', split='scaffold')
train_dataset, valid_dataset, test_dataset = datasets
model.fit(train_dataset, nb_epoch=100)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
print('training set score:', model.evaluate(train_dataset, [metric]))
print('test set score:', model.evaluate(test_dataset, [metric]))
```
%%%% Output: stream
training set score: {'roc_auc_score': 0.9996116504854369}
test set score: {'roc_auc_score': 0.7701992753623188}
%% Cell type:markdown id: tags:
## Other Features
`KerasModel` and `TorchModel` have lots of other features. Here are some of the more important ones.
- Automatically saving checkpoints during training.
- Logging progress to the console, to [TensorBoard](https://www.tensorflow.org/tensorboard), or to [Weights & Biases](https://docs.wandb.com/).
- Custom loss functions that you define with a function of the form `f(outputs, labels, weights)`.
- Early stopping using the `ValidationCallback` class.
- Loading parameters from pre-trained models.
- Estimating uncertainty in model outputs.
- Identifying important features through saliency mapping.
By wrapping your own models in a `KerasModel` or `TorchModel`, you get immediate access to all these features. See the API documentation for full details on them.
# Congratulations! Time to join the Community!
Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:
## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.
## Join the DeepChem Gitter
The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!
......
......@@ -7,7 +7,7 @@
"id": "6MNHvkiBl55x"
},
"source": [
"# Tutorial Part 10: Creating a High Fidelity Dataset from Experimental Data\n",
"# Creating a High Fidelity Dataset from Experimental Data\n",
"\n",
"In this tutorial, we will look at what is involved in creating a new Dataset from experimental data. As we will see, the mechanics of creating the Dataset object is only a small part of the process. Most real datasets need significant cleanup and QA before they are suitable for training models.\n",
"\n",
......
%% Cell type:markdown id: tags:
# Tutorial Part 10: Creating a High Fidelity Dataset from Experimental Data
# Creating a High Fidelity Dataset from Experimental Data
In this tutorial, we will look at what is involved in creating a new Dataset from experimental data. As we will see, the mechanics of creating the Dataset object is only a small part of the process. Most real datasets need significant cleanup and QA before they are suitable for training models.
## Colab
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/10_Creating_a_high_fidelity_model_from_experimental_data.ipynb)
## Setup
To run DeepChem within Colab, you'll need to run the following installation commands. This will take about 5 minutes to run to completion and install your environment. You can of course run this tutorial locally if you prefer. In that case, don't run these cells since they will download and install Anaconda on your local machine.
%% Cell type:code id: tags:
``` python
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e
```
%% Cell type:code id: tags:
``` python
!pip install --pre deepchem
import deepchem
deepchem.__version__
```
%% Cell type:markdown id: tags:
## Working With Data Files
Suppose you were given data collected by an experimental collaborator. You would like to use this data to construct a machine learning model.
*How do you transform this data into a dataset capable of creating a useful model?*
Building models from novel data can present several challenges. Perhaps the data was not recorded in a convenient manner. Additionally, perhaps the data contains noise. This is a common occurrence with, for example, biological assays due to the large number of external variables and the difficulty and cost associated with collecting multiple samples. This is a problem because you do not want your model to fit to this noise.
Hence, there are two primary challenges:
* Parsing data
* De-noising data
In this tutorial, we will walk through an example of curating a dataset from an excel spreadsheet of experimental drug measurements. Before we dive into this example though, let's do a brief review of DeepChem's input file handling and featurization capabilities.
### Input Formats
DeepChem supports a whole range of input files. For example, accepted input formats include .csv, .sdf, .fasta, .png, .tif and other file formats. The loading for a particular file format is governed by the `Loader` class associated with that format. For example, to load a .csv file we use the `CSVLoader` class. Here's an example of a .csv file that fits the requirements of `CSVLoader`.
1. A column containing SMILES strings.
2. A column containing an experimental measurement.
3. (Optional) A column containing a unique compound identifier.
Here's an example of a potential input file.
|Compound ID | measured log solubility in mols per litre | smiles |
|---------------|-------------------------------------------|----------------|
| benzothiazole | -1.5 | c2ccc1scnc1c2 |
Here the "smiles" column contains the SMILES string, the "measured log
solubility in mols per litre" contains the experimental measurement, and
"Compound ID" contains the unique compound identifier.
### Data Featurization
Most machine learning algorithms require that input data form vectors. However, input data for drug-discovery datasets routinely come in the form of lists of molecules and associated experimental readouts. To load the data, we use a subclass of `dc.data.DataLoader` such as `dc.data.CSVLoader` or `dc.data.SDFLoader`. Users can subclass `dc.data.DataLoader` to load arbitrary file formats. All loaders must be passed a `dc.feat.Featurizer` object, which specifies how to transform molecules into vectors. DeepChem provides a number of different subclasses of `dc.feat.Featurizer`.
%% Cell type:markdown id: tags:
## Parsing data
%% Cell type:markdown id: tags:
In order to read in the data, we will use the pandas data analysis library.
In order to convert the drug names into smiles strings, we will use pubchempy. This isn't a standard DeepChem dependency, but you can install this library with `conda install pubchempy`.
%% Cell type:code id: tags:
``` python
!conda install pubchempy
```
%% Cell type:code id: tags:
``` python
import os
import pandas as pd
from pubchempy import get_cids, get_compounds
```
%% Cell type:markdown id: tags:
Pandas is magic but it doesn't automatically know where to find your data of interest. You likely will have to look at it first using a GUI.
We will now look at a screenshot of this dataset as rendered by LibreOffice.
To do this, we will import Image and os.
%% Cell type:code id: tags:
``` python
import os
from IPython.display import Image, display
current_dir = os.path.dirname(os.path.realpath('__file__'))
data_screenshot = os.path.join(current_dir, 'assets/dataset_preparation_gui.png')
display(Image(filename=data_screenshot))
```
%%%% Output: display_data
%% Cell type:markdown id: tags:
We see the data of interest is on the second sheet, and contained in columns "TA ID", "N #1 (%)", and "N #2 (%)".
Additionally, it appears much of this spreadsheet was formatted for human readability (multicolumn headers, column labels with spaces and symbols, etc.). This makes the creation of a neat dataframe object harder. For this reason we will cut everything that is unnecesary or inconvenient.
%% Cell type:code id: tags:
``` python
import deepchem as dc
dc.utils.download_url(
'https://github.com/deepchem/deepchem/raw/master/datasets/Positive%20Modulators%20Summary_%20918.TUC%20_%20v1.xlsx',
current_dir,
'Positive Modulators Summary_ 918.TUC _ v1.xlsx'
)
```
%% Cell type:code id: tags:
``` python
raw_data_file = os.path.join(current_dir, 'Positive Modulators Summary_ 918.TUC _ v1.xlsx')
raw_data_excel = pd.ExcelFile(raw_data_file)
# second sheet only
raw_data = raw_data_excel.parse(raw_data_excel.sheet_names[1])
```
%% Cell type:code id: tags:
``` python
# preview 5 rows of raw dataframe
raw_data.loc[raw_data.index[:5]]
```
%%%% Output: execute_result
Unnamed: 0 Unnamed: 1 Unnamed: 2 Metric #1 (-120 mV Peak) \
0 NaN NaN NaN Vehicle
1 TA ## Position TA ID Mean
2 1 1-A02 Penicillin V Potassium -12.8689
3 2 1-A03 Mycophenolate Mofetil -12.8689
4 3 1-A04 Metaxalone -12.8689
Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7
0 NaN 4 Replications NaN
1 SD Threshold (%) = Mean + 4xSD N #1 (%) N #2 (%)
2 6.74705 14.1193 -10.404 -18.1929
3 6.74705 14.1193 -12.4453 -11.7175
4 6.74705 14.1193 -8.65572 -17.7753
%% Cell type:markdown id: tags:
Note that the actual row headers are stored in row 1 and not 0 above.
%% Cell type:code id: tags:
``` python
# remove column labels (rows 0 and 1), as we will replace them
# only take data given in columns "TA ID" "N #1 (%)" (3) and "N #2 (%)" (4)
raw_data = raw_data.iloc[2:, [2, 6, 7]]
# reset the index so we keep the label but number from 0 again
raw_data.reset_index(inplace=True)
## rename columns
raw_data.columns = ['label', 'drug', 'n1', 'n2']
```
%% Cell type:code id: tags:
``` python
# preview cleaner dataframe
raw_data.loc[raw_data.index[:5]]
```
%%%% Output: execute_result
label drug n1 n2
0 2 Penicillin V Potassium -10.404 -18.1929
1 3 Mycophenolate Mofetil -12.4453 -11.7175
2 4 Metaxalone -8.65572 -17.7753
3 5 Terazosin·HCl -11.5048 16.0825
4 6 Fluvastatin·Na -11.1354 -14.553
%% Cell type:markdown id: tags:
This formatting is closer to what we need.
Now, let's take the drug names and get smiles strings for them (format needed for DeepChem).
%% Cell type:code id: tags:
``` python
drugs = raw_data['drug'].values
```
%% Cell type:markdown id: tags:
For many of these, we can retreive the smiles string via the canonical_smiles attribute of the `get_compounds` object (using `pubchempy`)
%% Cell type:code id: tags:
``` python
get_compounds(drugs[1], 'name')
```
%%%% Output: execute_result
[Compound(5281078)]
%% Cell type:code id: tags:
``` python
get_compounds(drugs[1], 'name')[0].canonical_smiles
```
%%%% Output: execute_result
'CC1=C2COC(=O)C2=C(C(=C1OC)CC=C(C)CCC(=O)OCCN3CCOCC3)O'
%% Cell type:markdown id: tags:
However, some of these drug names have variables spaces and symbols (·, (±), etc.), and names that may not be readable by pubchempy.
For this task, we will do a bit of hacking via regular expressions. Also, we notice that all ions are written in a shortened form that will need to be expanded. For this reason we use a dictionary, mapping the shortened ion names to versions recognizable to pubchempy.
Unfortunately you may have several corner cases that will require more hacking.
%% Cell type:code id: tags:
``` python
import re
ion_replacements = {
'HBr': ' hydrobromide',
'2Br': ' dibromide',
'Br': ' bromide',
'HCl': ' hydrochloride',
'2H2O': ' dihydrate',
'H20': ' hydrate',
'Na': ' sodium'
}
ion_keys = ['H20', 'HBr', 'HCl', '2Br', '2H2O', 'Br', 'Na']
def compound_to_smiles(cmpd):
# remove spaces and irregular characters
compound = re.sub(r'([^\s\w]|_)+', '', cmpd)
# replace ion names if needed
for ion in ion_keys:
if ion in compound:
compound = compound.replace(ion, ion_replacements[ion])
# query for cid first in order to avoid timeouterror
cid = get_cids(compound, 'name')[0]
smiles = get_compounds(cid)[0].canonical_smiles
return smiles
```
%% Cell type:markdown id: tags:
Now let's actually convert all these compounds to smiles. This conversion will take a few minutes so might not be a bad spot to go grab a coffee or tea and take a break while this is running! Note that this conversion will sometimes fail so we've added some error handling to catch these cases below.
%% Cell type:code id: tags:
``` python
smiles_map = {}
for i, compound in enumerate(drugs):
try:
smiles_map[compound] = compound_to_smiles(compound)
except:
print("Errored on %s" % i)
continue
```
%%%% Output: stream
Errored on 162
Errored on 303
%% Cell type:code id: tags:
``` python
smiles_data = raw_data
# map drug name to smiles string
smiles_data['drug'] = smiles_data['drug'].apply(lambda x: smiles_map[x] if x in smiles_map else None)
```
%% Cell type:code id: tags:
``` python
# preview smiles data
smiles_data.loc[smiles_data.index[:5]]
```
%%%% Output: execute_result
label drug n1 n2
0 2 CC1(C(N2C(S1)C(C2=O)NC(=O)COC3=CC=CC=C3)C(=O)[... -10.404 -18.1929
1 3 CC1=C2COC(=O)C2=C(C(=C1OC)CC=C(C)CCC(=O)OCCN3C... -12.4453 -11.7175
2 4 CC1=CC(=CC(=C1)OCC2CNC(=O)O2)C -8.65572 -17.7753
3 5 COC1=C(C=C2C(=C1)C(=NC(=N2)N3CCN(CC3)C(=O)C4CC... -11.5048 16.0825
4 6 CC(C)N1C2=CC=CC=C2C(=C1C=CC(CC(CC(=O)[O-])O)O)... -11.1354 -14.553
%% Cell type:markdown id: tags:
Hooray, we have mapped each drug name to its corresponding smiles code.
Now, we need to look at the data and remove as much noise as possible.
%% Cell type:markdown id: tags:
## De-noising data
%% Cell type:markdown id: tags:
In machine learning, we know that there is no free lunch. You will need to spend time analyzing and understanding your data in order to frame your problem and determine the appropriate model framework. Treatment of your data will depend on the conclusions you gather from this process.
Questions to ask yourself:
* What are you trying to accomplish?
* What is your assay?
* What is the structure of the data?
* Does the data make sense?
* What has been tried previously?
For this project (respectively):
* I would like to build a model capable of predicting the affinity of an arbitrary small molecule drug to a particular ion channel protein
* For an input drug, data describing channel inhibition
* A few hundred drugs, with n=2
* Will need to look more closely at the dataset*
* Nothing on this particular protein
%% Cell type:markdown id: tags:
*This will involve plotting, so we will import matplotlib and seaborn. We will also need to look at molecular structures, so we will import rdkit. We will also use the seaborn library which you can install with `conda install seaborn`.
%% Cell type:code id: tags:
``` python
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('white')
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw, PyMol, rdFMCS
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
import numpy as np
```
%% Cell type:markdown id: tags:
Our goal is to build a small molecule model, so let's make sure our molecules are all small. This can be approximated by the length of each smiles string.
%% Cell type:code id: tags:
``` python
smiles_data['len'] = [len(i) if i is not None else 0 for i in smiles_data['drug']]
smiles_lens = [len(i) if i is not None else 0 for i in smiles_data['drug']]
sns.histplot(smiles_lens)
plt.xlabel('len(smiles)')
plt.ylabel('probability')
```
%%%% Output: execute_result
Text(0, 0.5, 'probability')
%%%% Output: display_data
%% Cell type:markdown id: tags:
Some of these look rather large, len(smiles) > 150. Let's see what they look like.
%% Cell type:code id: tags:
``` python
# indices of large looking molecules
suspiciously_large = np.where(np.array(smiles_lens) > 150)[0]
# corresponding smiles string
long_smiles = smiles_data.loc[smiles_data.index[suspiciously_large]]['drug'].values
# look
Draw._MolsToGridImage([Chem.MolFromSmiles(i) for i in long_smiles], molsPerRow=6)
```
%%%% Output: execute_result
<PIL.PngImagePlugin.PngImageFile image mode=RGB size=1200x200 at 0x14C4E1C90>
%% Cell type:markdown id: tags:
As suspected, these are not small molecules, so we will remove them from the dataset. The argument here is that these molecules could register as inhibitors simply because they are large. They are more likely to sterically blocks the channel, rather than diffuse inside and bind (which is what we are interested in).
The lesson here is to remove data that does not fit your use case.
%% Cell type:code id: tags:
``` python
# drop large molecules
smiles_data = smiles_data[~smiles_data['drug'].isin(long_smiles)]
```
%% Cell type:markdown id: tags:
Now, let's look at the numerical structure of the dataset.
First, check for NaNs.
%% Cell type:code id: tags:
``` python
nan_rows = smiles_data[smiles_data.isnull().T.any().T]
nan_rows[['n1', 'n2']]
```
%%%% Output: execute_result
n1 n2
62 NaN -7.8266
162 -12.8456 -11.4627
175 NaN -6.61225
187 NaN -8.23326
233 -8.21781 NaN
262 NaN -12.8788
288 NaN -2.34264
300 NaN -8.19936
301 NaN -10.4633
303 -5.61374 8.42267
311 NaN -8.78722
%% Cell type:markdown id: tags:
I don't trust n=1, so I will throw these out.
Then, let's examine the distribution of n1 and n2.
%% Cell type:code id: tags:
``` python
df = smiles_data.dropna(axis=0, how='any')
# seaborn jointplot will allow us to compare n1 and n2, and plot each marginal
sns.jointplot(x='n1', y='n2', data=smiles_data)
```
%%%% Output: execute_result
<seaborn.axisgrid.JointGrid at 0x14c4e37d0>
%%%% Output: display_data