Edits (12b4a58a) · Commits · 钟慕尧 / deepchem

examples/notebooks/Protein_Ligand_Complex_Interactions.ipynb

0 → 100644

+746 −0

File added.

Preview size limit exceeded, changes collapsed.

examples/notebooks/protein_ligand_complex_notebook.ipynb

+3 −148

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		# Basic Protein-Ligand Affinity Models
		#Tutorial: Use machine learning to model protein-ligand affinity.

		%% Cell type:markdown id: tags:

		Written by Evan Feinberg and Bharath Ramsundar

		Copyright 2016, Stanford University

		This DeepChem tutorial demonstrates how to use mach.ine learning for modeling protein-ligand binding affinity

		%% Cell type:markdown id: tags:

		Overview:

		In this tutorial, you will trace an arc from loading a raw dataset to fitting a cutting edge ML technique for predicting binding affinities. This will be accomplished by writing simple commands to access the deepchem Python API, encompassing the following broad steps:

		1. Loading a chemical dataset, consisting of a series of protein-ligand complexes.
		2. Featurizing each protein-ligand complexes with various featurization schemes.
		3. Fitting a series of models with these featurized protein-ligand complexes.
		4. Visualizing the results.

		%% Cell type:markdown id: tags:

		First, let's point to a "dataset" file. This can come in the format of a CSV file or Pandas DataFrame. Regardless
		of file format, it must be columnar data, where each row is a molecular system, and each column represents
		a different piece of information about that system. For instance, in this example, every row reflects a
		protein-ligand complex, and the following columns are present: a unique complex identifier; the SMILES string
		of the ligand; the binding affinity (Ki) of the ligand to the protein in the complex; a Python `list` of all lines
		in a PDB file for the protein alone; and a Python `list` of all lines in a ligand file for the ligand alone.

		This should become clearer with the example. (Make sure to set `DISPLAY = True`)

		%% Cell type:code id: tags:

		``` python
		%load_ext autoreload
		%autoreload 2
		%pdb off
		# set DISPLAY = True when running tutorial
		DISPLAY = False
		# set PARALLELIZE to true if you want to use ipyparallel
		PARALLELIZE = False
		import warnings
		warnings.filterwarnings('ignore')
		```

		%% Output

		Automatic pdb calling has been turned OFF

		%% Cell type:code id: tags:

		``` python
		import deepchem as dc
		from deepchem.utils import download_url

		import os

		data_dir = dc.utils.get_data_dir()
		dataset_file = os.path.join(data_dir, "pdbbind_core_df.csv.gz")

		if not os.path.exists(dataset_file):
		print('File does not exist. Downloading file...')
		download_url("https://s3-us-west-1.amazonaws.com/deepchem.io/datasets/pdbbind_core_df.csv.gz")
		print('File downloaded...')

		raw_dataset = dc.utils.save.load_from_disk(dataset_file)
		```

		%% Output

		File does not exist. Downloading file...
		File downloaded...

		%% Cell type:markdown id: tags:

		Let's see what `dataset` looks like:

		%% Cell type:code id: tags:

		``` python
		print("Type of dataset is: %s" % str(type(raw_dataset)))
		print(raw_dataset[:5])
		print("Shape of dataset is: %s" % str(raw_dataset.shape))
		```

		%% Output

		Type of dataset is: <class 'pandas.core.frame.DataFrame'>
		pdb_id smiles \
		0 2d3u CC1CCCCC1S(O)(O)NC1CC(C2CCC(CN)CC2)SC1C(O)O
		1 3cyx CC(C)(C)NC(O)C1CC2CCCCC2C[NH+]1CC(O)C(CC1CCCCC...
		2 3uo4 OC(O)C1CCC(NC2NCCC(NC3CCCCC3C3CCCCC3)N2)CC1
		3 1p1q CC1ONC(O)C1CC([NH3+])C(O)O
		4 3ag9 NC(O)C(CCC[NH2+]C([NH3+])[NH3+])NC(O)C(CCC[NH2...

		complex_id \
		0 2d3uCC1CCCCC1S(O)(O)NC1CC(C2CCC(CN)CC2)SC1C(O)O
		1 3cyxCC(C)(C)NC(O)C1CC2CCCCC2C[NH+]1CC(O)C(CC1C...
		2 3uo4OC(O)C1CCC(NC2NCCC(NC3CCCCC3C3CCCCC3)N2)CC1
		3 1p1qCC1ONC(O)C1CC([NH3+])C(O)O
		4 3ag9NC(O)C(CCC[NH2+]C([NH3+])[NH3+])NC(O)C(CCC...

		protein_pdb \
		0 ['HEADER 2D3U PROTEIN\n', 'COMPND 2D3U P...
		1 ['HEADER 3CYX PROTEIN\n', 'COMPND 3CYX P...
		2 ['HEADER 3UO4 PROTEIN\n', 'COMPND 3UO4 P...
		3 ['HEADER 1P1Q PROTEIN\n', 'COMPND 1P1Q P...
		4 ['HEADER 3AG9 PROTEIN\n', 'COMPND 3AG9 P...

		ligand_pdb \
		0 ['COMPND 2d3u ligand \n', 'AUTHOR GENERA...
		1 ['COMPND 3cyx ligand \n', 'AUTHOR GENERA...
		2 ['COMPND 3uo4 ligand \n', 'AUTHOR GENERA...
		3 ['COMPND 1p1q ligand \n', 'AUTHOR GENERA...
		4 ['COMPND 3ag9 ligand \n', 'AUTHOR GENERA...

		ligand_mol2 label
		0 ['### \n', '### Created by X-TOOL on Thu Aug 2... 6.92
		1 ['### \n', '### Created by X-TOOL on Thu Aug 2... 8.00
		2 ['### \n', '### Created by X-TOOL on Fri Aug 2... 6.52
		3 ['### \n', '### Created by X-TOOL on Thu Aug 2... 4.89
		4 ['### \n', '### Created by X-TOOL on Thu Aug 2... 8.05
		Shape of dataset is: (193, 7)

		%% Cell type:markdown id: tags:

		One of the missions of ```deepchem``` is to form a synapse between the chemical and the algorithmic worlds: to be able to leverage the powerful and diverse array of tools available in Python to analyze molecules. This ethos applies to visual as much as quantitative examination:

		%% Cell type:code id: tags:

		``` python
		import nglview
		import tempfile
		import os
		import mdtraj as md
		import numpy as np

		def combine_mdtraj(protein, ligand):
		chain = protein.topology.add_chain()
		residue = protein.topology.add_residue("LIG", chain, resSeq=1)
		for atom in ligand.topology.atoms:
		protein.topology.add_atom(atom.name, atom.element, residue)
		protein.xyz = np.hstack([protein.xyz, ligand.xyz])
		protein.topology.create_standard_bonds()
		return protein



		def visualize_complex(complex_mdtraj):
		ligand_atoms = [a.index for a in complex_mdtraj.topology.atoms if "LIG" in str(a.residue)]
		binding_pocket_atoms = md.compute_neighbors(complex_mdtraj, 0.5, ligand_atoms)[0]
		binding_pocket_residues = list(set([complex_mdtraj.topology.atom(a).residue.resSeq for a in binding_pocket_atoms]))
		binding_pocket_residues = [str(r) for r in binding_pocket_residues]
		binding_pocket_residues = " or ".join(binding_pocket_residues)

		traj = nglview.MDTrajTrajectory( complex_mdtraj ) # load file from RCSB PDB
		ngltraj = nglview.NGLWidget( traj )
		ngltraj.representations = [
		{ "type": "cartoon", "params": {
		"sele": "protein", "color": "residueindex"
		} },
		{ "type": "licorice", "params": {
		"sele": "(not hydrogen) and (%s)" % binding_pocket_residues
		} },
		{ "type": "ball+stick", "params": {
		"sele": "LIG"
		} }
		]
		return ngltraj

		def visualize_ligand(ligand_mdtraj):
		traj = nglview.MDTrajTrajectory( ligand_mdtraj ) # load file from RCSB PDB
		ngltraj = nglview.NGLWidget( traj )
		ngltraj.representations = [
		{ "type": "ball+stick", "params": {"sele": "all" } } ]
		return ngltraj

		def convert_lines_to_mdtraj(molecule_lines):
		molecule_lines = molecule_lines.strip('[').strip(']').replace("'","").replace("\\n", "").split(", ")
		tempdir = tempfile.mkdtemp()
		molecule_file = os.path.join(tempdir, "molecule.pdb")
		with open(molecule_file, "w") as f:
		for line in molecule_lines:
		f.write("%s\n" % line)
		molecule_mdtraj = md.load(molecule_file)
		return molecule_mdtraj

		first_protein, first_ligand = raw_dataset.iloc[0]["protein_pdb"], raw_dataset.iloc[0]["ligand_pdb"]
		protein_mdtraj = convert_lines_to_mdtraj(first_protein)
		ligand_mdtraj = convert_lines_to_mdtraj(first_ligand)
		complex_mdtraj = combine_mdtraj(protein_mdtraj, ligand_mdtraj)
		```

		%% Cell type:code id: tags:

		``` python
		ngltraj = visualize_complex(complex_mdtraj)
		ngltraj
		```

		%% Output


		%% Cell type:markdown id: tags:

		Now that we're oriented, let's use ML to do some chemistry.

		So, step (2) will entail featurizing the dataset.

		The available featurizations that come standard with deepchem are ECFP4 fingerprints, RDKit descriptors, NNScore-style bdescriptors, and hybrid binding pocket descriptors. Details can be found on ```deepchem.io```.

		%% Cell type:code id: tags:

		``` python
		grid_featurizer = dc.feat.RdkitGridFeaturizer(
		voxel_width=16.0, feature_types="voxel_combined",
		voxel_feature_types=["ecfp", "splif", "hbond", "pi_stack", "cation_pi", "salt_bridge"],
		ecfp_power=5, splif_power=5, parallel=True, flatten=True)
		compound_featurizer = dc.feat.CircularFingerprint(size=128)
		```

		%% Cell type:markdown id: tags:

		Note how we separate our featurizers into those that featurize individual chemical compounds, compound_featurizers, and those that featurize molecular complexes, complex_featurizers.

		Now, let's perform the actual featurization. Calling ```loader.featurize()``` will return an instance of class ```Dataset```. Internally, ```loader.featurize()``` (a) computes the specified features on the data, (b) transforms the inputs into ```X``` and ```y``` NumPy arrays suitable for ML algorithms, and (c) constructs a ```Dataset()``` instance that has useful methods, such as an iterator, over the featurized data. This is a little complicated, so we will use MoleculeNet to featurize the PDBBind core set for us.

		%% Cell type:code id: tags:

		``` python
		seed = 23
		np.random.seed(seed)
		PDBBIND_tasks, (train_dataset, valid_dataset, test_dataset), transformers = dc.molnet.load_pdbbind_grid()
		```

		%% Output

		Loading dataset from disk.
		TIMING: dataset construction took 0.022 s
		Loading dataset from disk.
		TIMING: dataset construction took 0.009 s
		Loading dataset from disk.
		TIMING: dataset construction took 0.009 s
		Loading dataset from disk.

		%% Cell type:markdown id: tags:

		Now, we conduct a train-test split. If you'd like, you can choose `splittype="scaffold"` instead to perform a train-test split based on Bemis-Murcko scaffolds.

		%% Cell type:markdown id: tags:

		We generate separate instances of the Dataset() object to hermetically seal the train dataset from the test dataset. This style lends itself easily to validation-set type hyperparameter searches, which we will illustate in a separate section of this tutorial.

		%% Cell type:markdown id: tags:

		The performance of many ML algorithms hinges greatly on careful data preprocessing. Deepchem comes standard with a few options for such preprocessing.

		%% Cell type:markdown id: tags:

		Now, we're ready to do some learning!

		To fit a deepchem model, first we instantiate one of the provided (or user-written) model classes. In this case, we have a created a convenience class to wrap around any ML model available in Sci-Kit Learn that can in turn be used to interoperate with deepchem. To instantiate an ```SklearnModel```, you will need (a) task_types, (b) model_params, another ```dict``` as illustrated below, and (c) a ```model_instance``` defining the type of model you would like to fit, in this case a ```RandomForestRegressor```.

		%% Cell type:code id: tags:

		``` python
		from sklearn.ensemble import RandomForestRegressor

		sklearn_model = RandomForestRegressor(n_estimators=10, max_features='sqrt')
		sklearn_model.random_state = seed
		model = dc.models.SklearnModel(sklearn_model)
		model.fit(train_dataset)
		```

		%% Cell type:code id: tags:

		``` python
		from deepchem.utils.evaluate import Evaluator
		import pandas as pd

		metric = dc.metrics.Metric(dc.metrics.r2_score)

		evaluator = Evaluator(model, train_dataset, transformers)
		train_r2score = evaluator.compute_model_performance([metric])
		print("RF Train set R^2 %f" % (train_r2score["r2_score"]))

		evaluator = Evaluator(model, valid_dataset, transformers)
		valid_r2score = evaluator.compute_model_performance([metric])
		print("RF Valid set R^2 %f" % (valid_r2score["r2_score"]))
		```

		%% Output

		computed_metrics: [0.82036547270294]
		RF Train set R^2 0.820365
		computed_metrics: [0.35669187005465053]
		RF Valid set R^2 0.356692

		%% Cell type:markdown id: tags:

		In this simple example, in few yet intuitive lines of code, we traced the machine learning arc from featurizing a raw dataset to fitting and evaluating a model.

		Here, we featurized only the ligand. The signal we observed in R^2 reflects the ability of circular fingerprints and random forests to learn general features that make ligands "drug-like."

		%% Cell type:code id: tags:

		``` python
		predictions = model.predict(test_dataset)
		print(predictions)
		```

		%% Output

		[6.198 5.127 6.812 6.104 5.065 6.184 6.265 6.668 5.614 6.408 6.531 7.665
		5.733 6.059 6.259 5.66 6.889 7.827 6.52 6.518]

		%% Cell type:markdown id: tags:

		# The protein-ligand complex view.

		%% Cell type:markdown id: tags:

		The preceding simple example, in few yet intuitive lines of code, traces the machine learning arc from featurizing a raw dataset to fitting and evaluating a model.

		In this next section, we illustrate ```deepchem```'s modularity, and thereby the ease with which one can explore different featurization schemes, different models, and combinations thereof, to achieve the best performance on a given dataset. We will demonstrate this by examining protein-ligand interactions.

		%% Cell type:markdown id: tags:

		In the previous section, we featurized only the ligand. The signal we observed in R^2 reflects the ability of grid fingerprints and random forests to learn general features that make ligands "drug-like." In this section, we demonstrate how to use hyperparameter searching to find a higher scoring ligands.


		%% Cell type:code id: tags:

		``` python
		def rf_model_builder(model_params, model_dir):
		sklearn_model = RandomForestRegressor(**model_params)
		sklearn_model.random_state = seed
		return dc.models.SklearnModel(sklearn_model, model_dir)

		params_dict = {
		"n_estimators": [10, 50, 100],
		"max_features": ["auto", "sqrt", "log2", None],
		}

		metric = dc.metrics.Metric(dc.metrics.r2_score)
		optimizer = dc.hyper.HyperparamOpt(rf_model_builder)
		best_rf, best_rf_hyperparams, all_rf_results = optimizer.hyperparam_search(
		params_dict, train_dataset, valid_dataset, transformers,
		metric=metric)
		```

		%% Output

		Fitting model 1/12
		hyperparameters: {'max_features': 'auto', 'n_estimators': 10}
		computed_metrics: [0.15655050098132517]
		Model 1/12, Metric r2_score, Validation set 0: 0.156551
		best_validation_score so far: 0.156551
		Fitting model 2/12
		hyperparameters: {'max_features': 'auto', 'n_estimators': 50}
		computed_metrics: [0.14911977798860765]
		Model 2/12, Metric r2_score, Validation set 1: 0.149120
		best_validation_score so far: 0.156551
		Fitting model 3/12
		hyperparameters: {'max_features': 'auto', 'n_estimators': 100}
		computed_metrics: [0.17367491804249124]
		Model 3/12, Metric r2_score, Validation set 2: 0.173675
		best_validation_score so far: 0.173675
		Fitting model 4/12
		hyperparameters: {'max_features': 'sqrt', 'n_estimators': 10}
		computed_metrics: [0.35669187005465053]
		Model 4/12, Metric r2_score, Validation set 3: 0.356692
		best_validation_score so far: 0.356692
		Fitting model 5/12
		hyperparameters: {'max_features': 'sqrt', 'n_estimators': 50}
		computed_metrics: [0.14841994477124643]
		Model 5/12, Metric r2_score, Validation set 4: 0.148420
		best_validation_score so far: 0.356692
		Fitting model 6/12
		hyperparameters: {'max_features': 'sqrt', 'n_estimators': 100}
		computed_metrics: [0.13774397492248824]
		Model 6/12, Metric r2_score, Validation set 5: 0.137744
		best_validation_score so far: 0.356692
		Fitting model 7/12
		hyperparameters: {'max_features': 'log2', 'n_estimators': 10}
		computed_metrics: [-0.02403182574818996]
		Model 7/12, Metric r2_score, Validation set 6: -0.024032
		best_validation_score so far: 0.356692
		Fitting model 8/12
		hyperparameters: {'max_features': 'log2', 'n_estimators': 50}
		computed_metrics: [0.14740665686378163]
		Model 8/12, Metric r2_score, Validation set 7: 0.147407
		best_validation_score so far: 0.356692
		Fitting model 9/12
		hyperparameters: {'max_features': 'log2', 'n_estimators': 100}
		computed_metrics: [0.1676204979052549]
		Model 9/12, Metric r2_score, Validation set 8: 0.167620
		best_validation_score so far: 0.356692
		Fitting model 10/12
		hyperparameters: {'max_features': None, 'n_estimators': 10}
		computed_metrics: [0.15655050098132517]
		Model 10/12, Metric r2_score, Validation set 9: 0.156551
		best_validation_score so far: 0.356692
		Fitting model 11/12
		hyperparameters: {'max_features': None, 'n_estimators': 50}
		computed_metrics: [0.14911977798860765]
		Model 11/12, Metric r2_score, Validation set 10: 0.149120
		best_validation_score so far: 0.356692
		Fitting model 12/12
		hyperparameters: {'max_features': None, 'n_estimators': 100}
		computed_metrics: [0.17367491804249124]
		Model 12/12, Metric r2_score, Validation set 11: 0.173675
		best_validation_score so far: 0.356692
		computed_metrics: [0.82036547270294]
		Best hyperparameters: ('sqrt', 10)
		train_score: 0.820365
		validation_score: 0.356692

		%% Cell type:code id: tags:

		``` python
		%matplotlib inline

		import matplotlib
		import numpy as np
		import matplotlib.pyplot as plt

		rf_predicted_test = best_rf.predict(test_dataset)
		rf_true_test = test_dataset.y
		plt.scatter(rf_predicted_test, rf_true_test)
		plt.xlabel('Predicted pIC50s')
		plt.ylabel('True IC50')
		plt.title(r'RF predicted IC50 vs. True pIC50')
		plt.xlim([2, 11])
		plt.ylim([2, 11])
		plt.plot([2, 11], [2, 11], color='k')
		plt.show()
		```

		%% Output

Admin message