Unverified Commit 03f1709e authored by Bharath Ramsundar's avatar Bharath Ramsundar Committed by GitHub
Browse files

Merge pull request #1492 from VIGS25/mol-feat-tut

[WIP] #1143: Molecular Featurization Tutorial
parents 92dcaf92 027b7fdc
Loading
Loading
Loading
Loading
+211 −0
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

## Molecular Featurization Tutorial

In this tutorial,  we explore the different featurization methods available for molecules. These featurization methods include:

1. `ConvMolFeaturizer`,
2. `WeaveFeaturizer`,
3. `CircularFingerprints`
4. `RDKitDescriptors`
5. `BPSymmetryFunction`
6. `CoulombMatrix`
7. `CoulombMatrixEig`
8. `AdjacencyFingerprints`

%% Cell type:markdown id: tags:

Let's start with some basic imports

%% Cell type:code id: tags:

``` python
from __future__ import print_function
from __future__ import division
from __future__ import unicode_literals

import numpy as np
from rdkit import Chem

from deepchem.feat import ConvMolFeaturizer, WeaveFeaturizer, CircularFingerprint
from deepchem.feat import AdjacencyFingerprint, RDKitDescriptors
from deepchem.feat import BPSymmetryFunction, CoulombMatrix, CoulombMatrixEig
from deepchem.utils import conformers
```

%% Cell type:markdown id: tags:

We use `propane`( $CH_3 CH_2 CH_3 $ ) as a running example throughout this tutorial. Many of the featurization methods use conformers or the molecules. A conformer can be generated using the `ConformerGenerator` class in `deepchem.utils.conformers`.

%% Cell type:markdown id: tags:

### RDKitDescriptors

%% Cell type:markdown id: tags:

`RDKitDescriptors` featurizes a molecule by computing descriptors values for specified descriptors. Intrinsic to the featurizer is a set of allowed descriptors, which can be accessed using `RDKitDescriptors.allowedDescriptors`.

The featurizer uses the descriptors in `rdkit.Chem.Descriptors.descList`, checks if they are in the list of allowed descriptors and computes the descriptor value for the molecule.

%% Cell type:code id: tags:

``` python
example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)
```

%% Cell type:markdown id: tags:

Let's check the allowed list of descriptors (Uncomment code below when needed)

%% Cell type:code id: tags:

``` python
# for descriptor in RDKitDescriptors.allowedDescriptors:
#     print(descriptor)
```

%% Cell type:code id: tags:

``` python
rdkit_desc = RDKitDescriptors()
features = rdkit_desc._featurize(example_mol)

print('The number of descriptors present are: ', len(features))
```

%% Cell type:markdown id: tags:

### BPSymmetryFunction

%% Cell type:markdown id: tags:

`Behler-Parinello Symmetry function` or `BPSymmetryFunction` featurizes a molecule by computing the atomic number and coordinates for each atom in the molecule. The features can be used as input for symmetry functions, like `RadialSymmetry`, `DistanceMatrix` and `DistanceCutoff` . More details on these symmetry functions can be found in [this paper](https://journals.aps.org/prl/pdf/10.1103/PhysRevLett.98.146401). These functions can be found in `deepchem.models.tensorgraph.symmetry_functions`

The featurizer takes in `max_atoms` as an argument. As input, it takes in a conformer of the molecule and computes:

1. coordinates of every atom in the molecule (in Bohr units)
2. the atomic numbers for all atoms.

These features are concantenated and padded with zeros to account for different number of atoms, across molecules.

%% Cell type:code id: tags:

``` python
example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)
engine = conformers.ConformerGenerator(max_conformers=1)
example_mol = engine.generate_conformers(example_mol)
```

%% Cell type:code id: tags:

``` python
bp_sym = BPSymmetryFunction(max_atoms=20)
features = bp_sym._featurize(mol=example_mol)
```

%% Cell type:markdown id: tags:

A simple check for the featurization would be to count the different atomic numbers present in the features.

%% Cell type:code id: tags:

``` python
atomic_numbers = features[:, 0]
from collections import Counter

unique_numbers = Counter(atomic_numbers)
print(unique_numbers)
```

%% Cell type:markdown id: tags:

For propane, we have $3$ `C-atoms` and $8$ `H-atoms`, and these numbers are in agreement with the results shown above. There's also the additional padding of 9 atoms, to equalize with `max_atoms`.

%% Cell type:markdown id: tags:

### CoulombMatrix

%% Cell type:markdown id: tags:

`CoulombMatrix`, featurizes a molecule by computing the coulomb matrices for different conformers of the molecule, and returning it as a list.

A Coulomb matrix tries to encode the energy structure of a molecule. The matrix is symmetric, with the off-diagonal elements capturing the Coulombic repulsion between pairs of atoms and the diagonal elements capturing atomic energies using the atomic numbers. More information on the functional forms used can be found [here](https://journals.aps.org/prl/pdf/10.1103/PhysRevLett.108.058301).

The featurizer takes in `max_atoms` as an argument and also has options for removing hydrogens from the molecule (`remove_hydrogens`), generating additional random coulomb matrices(`randomize`), and getting only the upper triangular matrix (`upper_tri`).

%% Cell type:code id: tags:

``` python
example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)

engine = conformers.ConformerGenerator(max_conformers=1)
example_mol = engine.generate_conformers(example_mol)

print("Number of available conformers for propane: ", len(example_mol.GetConformers()))
```

%% Cell type:code id: tags:

``` python
coulomb_mat = CoulombMatrix(max_atoms=20, randomize=False, remove_hydrogens=False, upper_tri=False)
features = coulomb_mat._featurize(mol=example_mol)
```

%% Cell type:markdown id: tags:

A simple check for the featurization is to see if the feature list has the same length as the number of conformers

%% Cell type:code id: tags:

``` python
print(len(example_mol.GetConformers()) == len(features))
```

%% Cell type:markdown id: tags:

### CoulombMatrixEig

%% Cell type:markdown id: tags:

`CoulombMatrix` is invariant to molecular rotation and translation, since the interatomic distances or atomic numbers do not change. However the matrix is not invariant to random permutations of the atom's indices. To deal with this, the `CoulumbMatrixEig` featurizer was introduced, which uses the eigenvalue spectrum of the columb matrix, and is invariant to random permutations of the atom's indices.

`CoulombMatrixEig` inherits from `CoulombMatrix` and featurizes a molecule by first computing the coulomb matrices for different conformers of the molecule and then computing the eigenvalues for each coulomb matrix. These eigenvalues are then padded to account for variation in number of atoms across molecules.

The featurizer takes in `max_atoms` as an argument and also has options for removing hydrogens from the molecule (`remove_hydrogens`), generating additional random coulomb matrices(`randomize`).

%% Cell type:code id: tags:

``` python
example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)

engine = conformers.ConformerGenerator(max_conformers=1)
example_mol = engine.generate_conformers(example_mol)

print("Number of available conformers for propane: ", len(example_mol.GetConformers()))
```

%% Cell type:code id: tags:

``` python
coulomb_mat_eig = CoulombMatrixEig(max_atoms=20, randomize=False, remove_hydrogens=False)
features = coulomb_mat_eig._featurize(mol=example_mol)
```

%% Cell type:code id: tags:

``` python
print(len(example_mol.GetConformers()) == len(features))
```

%% Cell type:markdown id: tags:

### Adjacency Fingerprints

%% Cell type:code id: tags:

``` python
```