Commit b518a543 authored by peastman's avatar peastman
Browse files

Merge branch 'master' into tf2

parents a7318f9e 713b8128
Loading
Loading
Loading
Loading
+3 −3
Original line number Diff line number Diff line
@@ -237,12 +237,12 @@ class Metric(object):
      if self.metric.__name__ in [
          "roc_auc_score", "matthews_corrcoef", "recall_score",
          "accuracy_score", "kappa_score", "precision_score",
          "balanced_accuracy_score", "prc_auc_score"
          "balanced_accuracy_score", "prc_auc_score", "f1_score"
      ]:
        mode = "classification"
      elif self.metric.__name__ in [
          "pearson_r2_score", "r2_score", "mean_squared_error",
          "mean_absolute_error", "rms_score", "mae_score"
          "mean_absolute_error", "rms_score", "mae_score", "pearsonr"
      ]:
        mode = "regression"
      else:
@@ -250,7 +250,7 @@ class Metric(object):
    assert mode in ["classification", "regression"]
    if self.metric.__name__ in [
        "accuracy_score", "balanced_accuracy_score", "recall_score",
        "matthews_corrcoef"
        "matthews_corrcoef", "precision_score", "f1_score"
    ] and threshold is None:
      self.threshold = 0.5
    self.mode = mode
+2005 −0

File added.

Preview size limit exceeded, changes collapsed.

+0 −29
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

### Input Formats
DeepChem supports a whole range of input files. For example, accepted input formats for deepchem include .csv, .sdf, .fasta, .png, .tif and other file formats. The loading for a particular file format is governed by `Loader` class associated with that format. For example, with a csv input, we use the `CSVLoader` class underneath the hood. Here's an example of a sample .csv file that fits the requirements of `CSVLoader`.

1. A column containing SMILES strings [1].
2. A column containing an experimental measurement.
3. (Optional) A column containing a unique compound identifier.

Here's an example of a potential input file.

|Compound ID    | measured log solubility in mols per litre | smiles         |
|---------------|-------------------------------------------|----------------|
| benzothiazole | -1.5                                      | c2ccc1scnc1c2  |


Here the "smiles" column contains the SMILES string, the "measured log
solubility in mols per litre" contains the experimental measurement and
"Compound ID" contains the unique compound identifier.

[2] Anderson, Eric, Gilman D. Veith, and David Weininger. "SMILES, a line
notation and computerized interpreter for chemical structures." US
Environmental Protection Agency, Environmental Research Laboratory, 1987.

### Data Featurization

Most machine learning algorithms require that input data form vectors. However, input data for drug-discovery datasets routinely come in the format of lists of molecules and associated experimental readouts. To
transform lists of molecules into vectors, we need to subclasses of DeepChem loader class ```dc.data.DataLoader``` such as ```dc.data.CSVLoader``` or ```dc.data.SDFLoader```. Users can subclass ```dc.data.DataLoader``` to
load arbitrary file formats. All loaders must be passed a ```dc.feat.Featurizer``` object. DeepChem provides a number of different subclasses of ```dc.feat.Featurizer``` for convenience.
+42 −15
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Quantum Machinery with gdb1k
# Exploring Quantum Chemistry with GDB1k

%% Cell type:markdown id: tags:

Most of the tutorials we've walked you through so far have focused on applications to the drug discovery realm, but DeepChem's tool suite works for molecular design problems generally. In this tutorial, we're going to walk through an example of how to train a simple molecular machine learning for the task of predicting the atomization energy of a molecule. (Remember that the atomization energy is the energy required to form 1 mol of gaseous atoms from 1 mol of the molecule in its standard state under standard conditions).

To get started, we'll do a few basic imports.

%% Cell type:code id: tags:

``` python
%load_ext autoreload
%autoreload 2
%pdb off
__author__ = "Joseph Gomes and Bharath Ramsundar"
__copyright__ = "Copyright 2016, Stanford University"
__license__ = "LGPL"

import os
import unittest

import numpy as np
import deepchem as dc
import numpy.random
from deepchem.utils.evaluate import Evaluator
from sklearn.ensemble import RandomForestRegressor
from sklearn.kernel_ridge import KernelRidge
```

%% Cell type:markdown id: tags:

Setting up model variables
The ntext step we want to do is load our dataset. We're using a small dataset we've prepared that's pulled out of the larger GDB benchmarks. The dataset contains the atomization energies for 1K small molecules.

%% Cell type:code id: tags:

``` python
featurizer = dc.feat.CoulombMatrixEig(23, remove_hydrogens=False)
tasks = ["atomization_energy"]
dataset_file = "../../datasets/gdb1k.sdf"
smiles_field = "smiles"
mol_field = "mol"
```

%% Cell type:markdown id: tags:

Load featurized data
We now need a way to transform molecules that is useful for prediction of atomization energy. This representation draws on foundational work [1] that represents a molecule's 3D electrostatic structure as a 2D matrix $C$ of distances scaled by charges, where the $ij$-th element is represented by the following charge structure.

$C_{ij} = \frac{q_i q_j}{r_{ij}^2}$

If you're observing carefully, you might ask, wait doesn't this mean that molecules with different numbers of atoms generate matrices of different sizes? In practice the trick to get around this is that the matrices are "zero-padded." That is, if you're making coulomb matrices for a set of molecules, you pick a maximum number of atoms $N$, make the matrices $N\times N$ and set to zero all the extra entries for this molecule. (There's a couple extra tricks that are done under the hood beyond this. Check out reference [1] or read the source code in DeepChem!)

DeepChem has a built in featurization class `dc.feat.CoulombMatrixEig` that can generate these featurizations for you.

%% Cell type:code id: tags:

``` python
featurizer = dc.feat.CoulombMatrixEig(23, remove_hydrogens=False)
```

%% Cell type:markdown id: tags:

Note that in this case, we set the maximum number of atoms to $N = 23$. Let's now load our dataset file into DeepChem. As in the previous tutorials, we use a `Loader` class, in particular `dc.data.SDFLoader` to load our `.sdf` file into DeepChem. The following snippet shows how we do this:

%% Cell type:code id: tags:

``` python
loader = dc.data.SDFLoader(
      tasks=["atomization_energy"], smiles_field="smiles",
      featurizer=featurizer,
      mol_field="mol")
dataset = loader.featurize(dataset_file)
```

%% Cell type:markdown id: tags:

Perform Train, Validation, and Testing Split
For the purposes of this tutorial, we're going to do a random split of the dataset into training, validation, and test. In general, this split is weak and will considerably overestimate the accuracy of our models, but for now in this simple tutorial isn't a bad place to get started.

%% Cell type:code id: tags:

``` python
random_splitter = dc.splits.RandomSplitter()
train_dataset, valid_dataset, test_dataset = random_splitter.train_valid_test_split(dataset)
```

%% Cell type:markdown id: tags:

Transforming datasets
One issue that Coulomb matrix featurizations have is that the range of entries in the matrix $C$ can be large. The charge $q_1q_2/r^2$ term can range very widely. In general, a wide range of values for inputs can throw off learning for the neural network. For this, a common fix is to normalize the input values so that they fall into a more standard range. Recall that the normalization transform applies to each feature $X_i$ of datapoint $X$

$\hat{X_i} = \frac{X_i - \mu_i}{\sigma_i}$

where $\mu_i$ and $\sigma_i$ are the mean and standard deviation of the $i$-th feature. This transformation enables the learning to proceed smoothly. A second point is that the atomization energies also fall across a wide range. So we apply an analogous transformation normalization transformation to the output to scale the energies better. We use DeepChem's transformation API to make this happen:

%% Cell type:code id: tags:

``` python
transformers = [
    dc.trans.NormalizationTransformer(transform_X=True, dataset=train_dataset),
    dc.trans.NormalizationTransformer(transform_y=True, dataset=train_dataset)]

for dataset in [train_dataset, valid_dataset, test_dataset]:
  for transformer in transformers:
      dataset = transformer.transform(dataset)
```

%% Cell type:markdown id: tags:

Fit Random Forest with hyperparameter search
Now that we have the data cleanly transformed, let's do some simple machine learning. We'll start by constructing a random forest on top of the data. We'll use DeepChem's hyperparameter tuning module to do this.

%% Cell type:code id: tags:

``` python
def rf_model_builder(model_params, model_dir):
  sklearn_model = RandomForestRegressor(**model_params)
  return dc.models.SklearnModel(sklearn_model, model_dir)
params_dict = {
    "n_estimators": [10, 100],
    "max_features": ["auto", "sqrt", "log2", None],
}

metric = dc.metrics.Metric(dc.metrics.mean_absolute_error)
optimizer = dc.hyper.HyperparamOpt(rf_model_builder)
best_rf, best_rf_hyperparams, all_rf_results = optimizer.hyperparam_search(
    params_dict, train_dataset, valid_dataset, transformers,
    metric=metric)
```

%% Cell type:markdown id: tags:

Let's build one more model, a kernel ridge regression, on top of this raw data.

%% Cell type:code id: tags:

``` python
def krr_model_builder(model_params, model_dir):
  sklearn_model = KernelRidge(**model_params)
  return dc.models.SklearnModel(sklearn_model, model_dir)

params_dict = {
    "kernel": ["laplacian"],
    "alpha": [0.0001],
    "gamma": [0.0001]
}

metric = dc.metrics.Metric(dc.metrics.mean_absolute_error)
optimizer = dc.hyper.HyperparamOpt(krr_model_builder)
best_krr, best_krr_hyperparams, all_krr_results = optimizer.hyperparam_search(
    params_dict, train_dataset, valid_dataset, transformers,
    metric=metric)
```

%% Cell type:markdown id: tags:

**Bibliography:**

[1] https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.98.146401
Loading