Commit 15a1cf1b authored by Arun's avatar Arun
Browse files

moved asset to assets [skip ci]

parent 57a0a5a7
......@@ -14,7 +14,7 @@
"\n",
"Mariia Matveieva, Pavel Polishchuk. Institute of Molecular and Translational Medicine, Palacky University, Olomouc, Czech Republic.\n",
"\n",
"<img src=\"atomic_contributions_tutorial_data/index.png\">\n",
"<img src=\"assets/atomic_contributions_tutorial_data/index.png\">\n",
"\n",
"## Colab\n",
"\n",
......@@ -80,7 +80,7 @@
"from rdkit.Chem.Draw import SimilarityMaps\n",
"import tensorflow as tf\n",
"\n",
"DATASET_FILE ='atomic_contributions_tutorial_data/logBB.sdf'\n",
"DATASET_FILE ='assets/atomic_contributions_tutorial_data/logBB.sdf'\n",
"# Create RDKit mol objects, since we will need them later.\n",
"mols = [m for m in Chem.SDMolSupplier(DATASET_FILE) if m is not None ]\n",
"loader = dc.data.SDFLoader(tasks=[\"logBB_class\"], \n",
......@@ -152,7 +152,7 @@
}
],
"source": [
"TEST_DATASET_FILE = 'atomic_contributions_tutorial_data/logBB_test_.sdf'\n",
"TEST_DATASET_FILE = 'assets/atomic_contributions_tutorial_data/logBB_test_.sdf'\n",
"loader = dc.data.SDFLoader(tasks=[\"p_np\"], sanitize=True,\n",
" featurizer=dc.feat.ConvMolFeaturizer())\n",
"test_dataset = loader.create_dataset(TEST_DATASET_FILE, shard_size=2000)\n",
......@@ -619,7 +619,7 @@
"metadata": {},
"outputs": [],
"source": [
"DATASET_FILE ='atomic_contributions_tutorial_data/Tetrahymena_pyriformis_Work_set_OCHEM.sdf'\n",
"DATASET_FILE ='assets/atomic_contributions_tutorial_data/Tetrahymena_pyriformis_Work_set_OCHEM.sdf'\n",
"# create RDKit mol objects, we will need them later\n",
"mols = [m for m in Chem.SDMolSupplier(DATASET_FILE) if m is not None ]\n",
"loader = dc.data.SDFLoader(tasks=[\"IGC50\"], \n",
......@@ -683,7 +683,7 @@
}
],
"source": [
"TEST_DATASET_FILE = 'atomic_contributions_tutorial_data/Tetrahymena_pyriformis_Test_set_OCHEM.sdf'\n",
"TEST_DATASET_FILE = 'assets/atomic_contributions_tutorial_data/Tetrahymena_pyriformis_Test_set_OCHEM.sdf'\n",
"loader = dc.data.SDFLoader(tasks=[\"IGC50\"], sanitize= True,\n",
" featurizer=dc.feat.ConvMolFeaturizer())\n",
"test_dataset = loader.create_dataset(TEST_DATASET_FILE, shard_size=2000)\n",
%% Cell type:markdown id: tags:
 
# Calculating Atomic Contributions for Molecules Based on a Graph Convolutional QSAR Model
 
In an earlier tutorial we introduced the concept of model interpretability: understanding why a model produced the result it did. In this tutorial we will learn about atomic contributions, a useful tool for interpreting models that operate on molecules.
 
The idea is simple: remove a single atom from the molecule and see how the model's prediction changes. The "atomic contribution" for an atom is defined as the difference in activity between the whole molecule, and the fragment remaining after atom removal. It is a measure of how much that atom affects the prediction.
 
Contributions are also known as "attributions", "coloration", etc. in the literature. This is a model interpretation method [1], analogous to Similarity maps [2] in the QSAR domain, or occlusion methods in other fields (image classification, etc). Present implementation was used in [4].
 
Mariia Matveieva, Pavel Polishchuk. Institute of Molecular and Translational Medicine, Palacky University, Olomouc, Czech Republic.
 
<img src="atomic_contributions_tutorial_data/index.png">
<img src="assets/atomic_contributions_tutorial_data/index.png">
 
## Colab
 
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.
 
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/Atomic_Contributions_for_Molecules.ipynb)
 
## Setup
 
To run DeepChem within Colab, you'll need to run the following installation commands. This will take about 5 minutes to run to completion and install your environment. You can of course run this tutorial locally if you prefer. In that case, don't run these cells since they will download and install Anaconda on your local machine.
 
%% Cell type:code id: tags:
 
``` python
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e
```
 
%% Cell type:code id: tags:
 
``` python
!pip install --pre deepchem
import deepchem
deepchem.__version__
```
 
%% Cell type:markdown id: tags:
 
## A classification QSAR model for blood-brain barrier permeability
 
BBB permeability is the ability of compounds to enter the central nervous system. Here we use a dataset of relatively small compounds which are transported by diffusion without any carriers. The property is defined as log10(concentration in brain / concentration in blood). Compounds with a positive value (and 0) are labeled active, and others are labeled inactive. After modelling we will identify atoms favorable and unfavorable for diffusion.
 
First let's create the dataset. The molecules are stored in an SDF file.
 
%% Cell type:code id: tags:
 
``` python
import pandas as pd
import deepchem as dc
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw, PyMol, rdFMCS
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
from deepchem import metrics
from IPython.display import Image, display
from rdkit.Chem.Draw import SimilarityMaps
import tensorflow as tf
 
DATASET_FILE ='atomic_contributions_tutorial_data/logBB.sdf'
DATASET_FILE ='assets/atomic_contributions_tutorial_data/logBB.sdf'
# Create RDKit mol objects, since we will need them later.
mols = [m for m in Chem.SDMolSupplier(DATASET_FILE) if m is not None ]
loader = dc.data.SDFLoader(tasks=["logBB_class"],
featurizer=dc.feat.ConvMolFeaturizer(),
sanitize=True)
dataset = loader.create_dataset(DATASET_FILE, shard_size=2000)
```
 
%% Cell type:markdown id: tags:
 
Now let's build and train a GraphConvModel.
 
%% Cell type:code id: tags:
 
``` python
np.random.seed(2020)
tf.random.set_seed(2020)
```
 
%% Cell type:code id: tags:
 
``` python
m = dc.models.GraphConvModel(1, mode="classification", batch_normalize=False, batch_size=100)
m.fit(dataset, nb_epoch=10)
```
 
%%%% Output: execute_result
 
0.5348200480143229
 
%% Cell type:markdown id: tags:
 
Let's load a test set and see how well it works.
 
%% Cell type:code id: tags:
 
``` python
TEST_DATASET_FILE = 'atomic_contributions_tutorial_data/logBB_test_.sdf'
TEST_DATASET_FILE = 'assets/atomic_contributions_tutorial_data/logBB_test_.sdf'
loader = dc.data.SDFLoader(tasks=["p_np"], sanitize=True,
featurizer=dc.feat.ConvMolFeaturizer())
test_dataset = loader.create_dataset(TEST_DATASET_FILE, shard_size=2000)
pred = m.predict(test_dataset)
pred = np.argmax(np.squeeze(pred),axis=1)
ba = metrics.balanced_accuracy_score(y_true=test_dataset.y, y_pred=pred)
print(ba)
```
 
%%%% Output: stream
 
0.7444444444444445
 
%% Cell type:markdown id: tags:
 
The balanced accuracy is high enough. Now let's proceed to model interpretation and estimate the contributions of individual atoms to the prediction.
 
%% Cell type:markdown id: tags:
 
## A fragment dataset
 
Now let's prepare a dataset of fragments based on the training set. (Any other unseen data set of interest can also be used). These fragments will be used to evaluate the contributions of individual atoms.
 
For each molecule we will generate a list of ConvMol objects. Specifying `per_atom_fragmentation=True` tells it to iterate over all heavy atoms and featurize a single-atom-depleted version of the molecule with each one removed.
 
%% Cell type:code id: tags:
 
``` python
loader = dc.data.SDFLoader(tasks=[],# dont need task (moreover, passing the task can lead to inconsitencies in data shapes)
featurizer=dc.feat.ConvMolFeaturizer(per_atom_fragmentation=True),
sanitize=True)
frag_dataset = loader.create_dataset(DATASET_FILE, shard_size=5000)
```
 
%% Cell type:markdown id: tags:
 
The dataset still has the same number of samples as the original training set, but each sample is now represented as a list of ConvMol objects (one for each fragment) rather than a single ConvMol.
 
IMPORTANT: The order of fragments depends on the input format. If SDF, the fragment order is the same as the atom order in corresponding mol blocks. If SMILES (i.e. csv with molecules represented as SMILES), then the order is given by RDKit CanonicalRankAtoms
 
%% Cell type:code id: tags:
 
``` python
print(frag_dataset.X.shape)
```
 
%%%% Output: stream
 
(298,)
 
%% Cell type:markdown id: tags:
 
We really want to treat each fragment as a separate sample. We can use a FlatteningTransformer to flatten the fragments lists.
 
%% Cell type:code id: tags:
 
``` python
tr = dc.trans.FlatteningTransformer(frag_dataset)
frag_dataset = tr.transform(frag_dataset)
print(frag_dataset.X.shape)
```
 
%%%% Output: stream
 
(5111,)
 
%% Cell type:markdown id: tags:
 
## Predicting atomic contributions to activity
 
Now we will predict the activity for molecules and for their fragments. Then, for each fragment, we'll find the activity difference: the change in activity when removing one atom.
 
Note: Here, in classification context, we use the probability output of the model as the activity. So the contribution is the probability difference, i.e. "how much a given atom increases/decreases the probability of the molecule being active."
 
%% Cell type:code id: tags:
 
``` python
# whole molecules
pred = np.squeeze(m.predict(dataset))[:, 1] # probabilitiy of class 1
pred = pd.DataFrame(pred, index=dataset.ids, columns=["Molecule"]) # turn to dataframe for convinience
 
# fragments
pred_frags = np.squeeze(m.predict(frag_dataset))[:, 1]
pred_frags = pd.DataFrame(pred_frags, index=frag_dataset.ids, columns=["Fragment"])
```
 
%% Cell type:markdown id: tags:
 
We take the difference to find the atomic contributions.
 
%% Cell type:code id: tags:
 
``` python
# merge 2 dataframes by molecule names
df = pd.merge(pred_frags, pred, right_index=True, left_index=True)
# find contribs
df['Contrib'] = df["Molecule"] - df["Fragment"]
```
 
%% Cell type:code id: tags:
 
``` python
df
```
 
%%%% Output: execute_result
 
Fragment Molecule Contrib
C#CC1(O)CCC2C3C(C)CC4=C(CCC(=O)C4)C3CCC21C 0.756531 0.811546 0.055015
C#CC1(O)CCC2C3C(C)CC4=C(CCC(=O)C4)C3CCC21C 0.752750 0.811546 0.058796
C#CC1(O)CCC2C3C(C)CC4=C(CCC(=O)C4)C3CCC21C 0.747007 0.811546 0.064539
C#CC1(O)CCC2C3C(C)CC4=C(CCC(=O)C4)C3CCC21C 0.815875 0.811546 -0.004329
C#CC1(O)CCC2C3C(C)CC4=C(CCC(=O)C4)C3CCC21C 0.741801 0.811546 0.069745
... ... ... ...
c1cncc(C2CCCN2)c1 0.780478 0.813036 0.032558
c1cncc(C2CCCN2)c1 0.722650 0.813036 0.090386
c1cncc(C2CCCN2)c1 0.721609 0.813036 0.091427
c1cncc(C2CCCN2)c1 0.683302 0.813036 0.129734
c1cncc(C2CCCN2)c1 0.674457 0.813036 0.138578
[5111 rows x 3 columns]
 
%% Cell type:markdown id: tags:
 
We can use the SimilarityMaps feature of RDKit to visualize the results. Each atom is colored by how it affects activity.
 
%% Cell type:code id: tags:
 
``` python
def vis_contribs(mols, df, smi_or_sdf = "sdf"):
# input format of file, which was used to create dataset determines the order of atoms,
# so we take it into account for correct mapping!
maps = []
for mol in mols:
wt = {}
if smi_or_sdf == "smi":
for n,atom in enumerate(Chem.rdmolfiles.CanonicalRankAtoms(mol)):
wt[atom] = df.loc[mol.GetProp("_Name"),"Contrib"][n]
if smi_or_sdf == "sdf":
for n,atom in enumerate(range(mol.GetNumHeavyAtoms())):
wt[atom] = df.loc[Chem.MolToSmiles(mol),"Contrib"][n]
maps.append(SimilarityMaps.GetSimilarityMapFromWeights(mol,wt))
return maps
```
 
%% Cell type:markdown id: tags:
 
Let's look at some pictures:
 
%% Cell type:code id: tags:
 
``` python
np.random.seed(2000)
maps = vis_contribs(np.random.choice(np.array(mols),10), df)
```
 
%%%% Output: display_data
 
 
%%%% Output: display_data
 
 
%%%% Output: display_data
 
 
%%%% Output: display_data
 
 
%%%% Output: display_data
 
 
%%%% Output: display_data
 
 
%%%% Output: display_data
 
 
%%%% Output: display_data
 
 
%%%% Output: display_data
 
 
%%%% Output: display_data
 
 
%% Cell type:markdown id: tags:
 
We can see that aromatics or aliphatics have a positive impact on blood-brain barrier permeability, while polar or charged heteroatoms have a negative influence. This is generally consistent with literature data.
 
%% Cell type:markdown id: tags:
 
## A regression task
 
The example above used a classification model. The same techniques can also be used for regression models. Let's look at a regression task, aquatic toxicity (towards the water organism T. pyriformis).
 
Toxicity is defined as log10(IGC50) (concentration that inhibits colony growth by 50%). Toxicophores for T. pyriformis will be identified by atomic contributions.
 
All the above steps are the same: load data, featurize, build a model, create dataset of fragments, find contributions, and visualize them.
 
Note: this time as it is regression, contributions will be in activity units, not probability.
 
%% Cell type:code id: tags: