Merge pull request #1753 from deepchem/cleanup (63f1e506) · Commits · 钟慕尧 / deepchem

examples/notebooks/Creating_a_high_fidelity_model_from_experimental_data.ipynb

+29 −1

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		# Creating a high fidelity dataset from experimental data

		%% Cell type:markdown id: tags:

		Suppose you were given data collected by an experimental collaborator. You would like to use this data to construct a machine learning model.

		How do you transform this data into a dataset capable of creating a useful model?

		%% Cell type:markdown id: tags:

		Building models from novel data can present several challenges. Perhaps the data was not recorded in a convenient manner. Additionally, perhaps the data contains noise. This is a common occurance with, for example, biological assays due to the large number of external variables and the difficulty and cost associated with collecting multiple samples. This is a problem because you do not want your model to fit to this noise.

		Hence, there are two primary challenges:
		* Parsing data
		* De-noising data

		In this tutorial, will walk through an example of curating a dataset from an excel spreadsheet of experimental drug measurements.
		In this tutorial, will walk through an example of curating a dataset from an excel spreadsheet of experimental drug measurements. Before we dive into this example though, let's do a brief review of DeepChem's input file handling and featurization capabilities.

		### Input Formats
		DeepChem supports a whole range of input files. For example, accepted input formats for deepchem include .csv, .sdf, .fasta, .png, .tif and other file formats. The loading for a particular file format is governed by `Loader` class associated with that format. For example, with a csv input, we use the `CSVLoader` class underneath the hood. Here's an example of a sample .csv file that fits the requirements of `CSVLoader`.

		1. A column containing SMILES strings [1].
		2. A column containing an experimental measurement.
		3. (Optional) A column containing a unique compound identifier.

		Here's an example of a potential input file.

		\|Compound ID \| measured log solubility in mols per litre \| smiles \|
		\|---------------\|-------------------------------------------\|----------------\|
		\| benzothiazole \| -1.5 \| c2ccc1scnc1c2 \|


		Here the "smiles" column contains the SMILES string, the "measured log
		solubility in mols per litre" contains the experimental measurement and
		"Compound ID" contains the unique compound identifier.

		[2] Anderson, Eric, Gilman D. Veith, and David Weininger. "SMILES, a line
		notation and computerized interpreter for chemical structures." US
		Environmental Protection Agency, Environmental Research Laboratory, 1987.

		### Data Featurization

		Most machine learning algorithms require that input data form vectors. However, input data for drug-discovery datasets routinely come in the format of lists of molecules and associated experimental readouts. To
		transform lists of molecules into vectors, we need to subclasses of DeepChem loader class ```dc.data.DataLoader``` such as ```dc.data.CSVLoader``` or ```dc.data.SDFLoader```. Users can subclass ```dc.data.DataLoader``` to
		load arbitrary file formats. All loaders must be passed a ```dc.feat.Featurizer``` object. DeepChem provides a number of different subclasses of ```dc.feat.Featurizer``` for convenience.

		%% Cell type:markdown id: tags:

		## Parsing data

		%% Cell type:markdown id: tags:

		In order to read in the data, we will use the pandas data analysis library.

		In order to convert the drug names into smiles strings, we will use pubchempy. This isn't a standard DeepChem dependency, but you can install this library with `pip install pubchempy`.

		%% Cell type:code id: tags:

		``` python
		import os
		import pandas as pd
		from pubchempy import get_cids, get_compounds
		```

		%% Cell type:markdown id: tags:

		Pandas is magic but it doesn't automatically know where to find your data of interest. You likely will have to look at it first using a GUI.

		We will now look at a screenshot of this dataset as rendered by LibreOffice.

		To do this, we will import Image and os.

		%% Cell type:code id: tags:

		``` python
		import os
		from IPython.display import Image, display
		```

		%% Cell type:code id: tags:

		``` python
		current_dir = os.path.dirname(os.path.realpath('__file__'))
		```

		%% Cell type:code id: tags:

		``` python
		data_screenshot = os.path.join(current_dir, 'assets/dataset_preparation_gui.png')
		display(Image(filename=data_screenshot))
		```

		%% Output



		%% Cell type:markdown id: tags:

		We see the data of interest is on the second sheet, and contained in columns "TA ID", "N #1 (%)", and "N #2 (%)".

		Additionally, it appears much of this spreadsheet was formatted for human readability (multicolumn headers, column labels with spaces and symbols, etc.). This makes the creation of a neat dataframe object harder. For this reason we will cut everything that is unnecesary or inconvenient.

		In order to load Excel data in python, we will use the `xlrd` library under the hood. You can install this library with `pip install xlrd`.

		%% Cell type:code id: tags:

		``` python
		raw_data_file = os.path.join(current_dir, '../../datasets/Positive Modulators Summary_ 918.TUC _ v1.xlsx')
		raw_data_excel = pd.ExcelFile(raw_data_file)

		# second sheet only
		raw_data = raw_data_excel.parse(raw_data_excel.sheet_names[1])
		```

		%% Cell type:code id: tags:

		``` python
		# preview 5 rows of raw dataframe
		raw_data.loc[raw_data.index[:5]]
		```

		%% Output

		Unnamed: 0 Unnamed: 1 Unnamed: 2 Metric #1 (-120 mV Peak) \
		0 NaN NaN NaN Vehicle
		1 TA ## Position TA ID Mean
		2 1 1-A02 Penicillin V Potassium -12.8689
		3 2 1-A03 Mycophenolate Mofetil -12.8689
		4 3 1-A04 Metaxalone -12.8689

		Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7
		0 NaN 4 Replications NaN
		1 SD Threshold (%) = Mean + 4xSD N #1 (%) N #2 (%)
		2 6.74705 14.1193 -10.404 -18.1929
		3 6.74705 14.1193 -12.4453 -11.7175
		4 6.74705 14.1193 -8.65572 -17.7753

		%% Cell type:markdown id: tags:

		Note that the actual row headers are stored in row 1 and not 0 above.

		%% Cell type:code id: tags:

		``` python
		# remove column labels (rows 0 and 1), as we will replace them
		# only take data given in columns "TA ID" "N #1 (%)" (3) and "N #2 (%)" (4)
		raw_data = raw_data.iloc[2:, [2, 6, 7]]
		print(raw_data.loc[raw_data.index[:5]])

		## collapse multiindex so that drug names and number indexes are columns
		#raw_data.reset_index(level=[1, 2], inplace=True)
		# reset the index so we keep the label but number from 0 again
		raw_data.reset_index(inplace=True)

		## rename columns
		raw_data.columns = ['label', 'drug', 'n1', 'n2']
		```

		%% Output

		Unnamed: 2 Unnamed: 6 Unnamed: 7
		2 Penicillin V Potassium -10.404 -18.1929
		3 Mycophenolate Mofetil -12.4453 -11.7175
		4 Metaxalone -8.65572 -17.7753
		5 Terazosin·HCl -11.5048 16.0825
		6 Fluvastatin·Na -11.1354 -14.553

		%% Cell type:code id: tags:

		``` python
		# preview cleaner dataframe
		raw_data.loc[raw_data.index[:5]]
		```

		%% Output

		label drug n1 n2
		0 2 Penicillin V Potassium -10.404 -18.1929
		1 3 Mycophenolate Mofetil -12.4453 -11.7175
		2 4 Metaxalone -8.65572 -17.7753
		3 5 Terazosin·HCl -11.5048 16.0825
		4 6 Fluvastatin·Na -11.1354 -14.553

		%% Cell type:markdown id: tags:

		This formatting is closer to what we need.

		Now, let's take the drug names and get smiles strings for them (format needed for DeepChem).

		%% Cell type:code id: tags:

		``` python
		drugs = raw_data['drug'].values
		```

		%% Cell type:markdown id: tags:

		For many of these, we can retreive the smiles string via the canonical_smiles attribute of the `get_compounds` object (using `pubchempy`)

		%% Cell type:code id: tags:

		``` python
		get_compounds(drugs[1], 'name')
		```

		%% Output

		[Compound(5281078)]

		%% Cell type:code id: tags:

		``` python
		get_compounds(drugs[1], 'name')[0].canonical_smiles
		```

		%% Output

		'CC1=C2COC(=O)C2=C(C(=C1OC)CC=C(C)CCC(=O)OCCN3CCOCC3)O'

		%% Cell type:markdown id: tags:

		However, some of these drug names have variables spaces and symbols (·, (±), etc.), and names that may not be readable by pubchempy.

		For this task, we will do a bit of hacking via regular expressions. Also, we notice that all ions are written in a shortened form that will need to be expanded. For this reason we use a dictionary, mapping the shortened ion names to versions recognizable to pubchempy.

		Unfortunately you may have several corner cases that will require more hacking.

		%% Cell type:code id: tags:

		``` python
		ion_replacements = {
		'HBr': ' hydrobromide',
		'2Br': ' dibromide',
		'Br': ' bromide',
		'HCl': ' hydrochloride',
		'2H2O': ' dihydrate',
		'H20': ' hydrate',
		'Na': ' sodium'
		}

		ion_keys = ['H20', 'HBr', 'HCl', '2Br', '2H2O', 'Br', 'Na']
		```

		%% Cell type:code id: tags:

		``` python
		import re
		```

		%% Cell type:code id: tags:

		``` python
		def compound_to_smiles(cmpd):
		# remove spaces and irregular characters
		compound = re.sub(r'([^\s\w]\|_)+', '', cmpd)

		# replace ion names if needed
		for ion in ion_keys:
		if ion in compound:
		compound = compound.replace(ion, ion_replacements[ion])

		# query for cid first in order to avoid timeouterror
		cid = get_cids(compound, 'name')[0]
		smiles = get_compounds(cid)[0].canonical_smiles

		return smiles
		```

		%% Cell type:markdown id: tags:

		Now let's actually convert all these compounds to smiles. This conversion will take a few minutes so might not be a bad spot to go grab a coffee or tea and take a break while this is running! Note that this conversion will sometimes fail so we've added some error handling to catch these cases below.

		%% Cell type:code id: tags:

		``` python
		smiles_map = {}
		for i, compound in enumerate(drugs):
		print("Converting %s to smiles" % i)
		try:
		smiles_map[compound] = compound_to_smiles(compound)
		except:
		print("Errored on %s" % i)
		continue
		```

		%% Output

		Converting 0 to smiles
		Converting 1 to smiles
		Converting 2 to smiles
		Converting 3 to smiles
		Converting 4 to smiles
		Converting 5 to smiles
		Converting 6 to smiles
		Converting 7 to smiles
		Converting 8 to smiles
		Converting 9 to smiles
		Converting 10 to smiles
		Converting 11 to smiles
		Converting 12 to smiles
		Converting 13 to smiles
		Converting 14 to smiles
		Converting 15 to smiles
		Converting 16 to smiles
		Converting 17 to smiles
		Converting 18 to smiles
		Converting 19 to smiles
		Converting 20 to smiles
		Converting 21 to smiles
		Converting 22 to smiles
		Converting 23 to smiles
		Converting 24 to smiles
		Converting 25 to smiles
		Converting 26 to smiles
		Converting 27 to smiles
		Converting 28 to smiles
		Converting 29 to smiles
		Converting 30 to smiles
		Converting 31 to smiles
		Converting 32 to smiles
		Converting 33 to smiles
		Converting 34 to smiles
		Converting 35 to smiles
		Converting 36 to smiles
		Converting 37 to smiles
		Converting 38 to smiles
		Converting 39 to smiles
		Converting 40 to smiles
		Converting 41 to smiles
		Converting 42 to smiles
		Converting 43 to smiles
		Converting 44 to smiles
		Converting 45 to smiles
		Converting 46 to smiles
		Converting 47 to smiles
		Converting 48 to smiles
		Converting 49 to smiles
		Converting 50 to smiles
		Converting 51 to smiles
		Converting 52 to smiles
		Converting 53 to smiles
		Converting 54 to smiles
		Converting 55 to smiles
		Converting 56 to smiles
		Converting 57 to smiles
		Converting 58 to smiles
		Converting 59 to smiles
		Converting 60 to smiles
		Converting 61 to smiles
		Converting 62 to smiles
		Converting 63 to smiles
		Converting 64 to smiles
		Converting 65 to smiles
		Converting 66 to smiles
		Converting 67 to smiles
		Converting 68 to smiles
		Converting 69 to smiles
		Converting 70 to smiles
		Converting 71 to smiles
		Converting 72 to smiles
		Converting 73 to smiles
		Converting 74 to smiles
		Converting 75 to smiles
		Converting 76 to smiles
		Converting 77 to smiles
		Converting 78 to smiles
		Converting 79 to smiles
		Converting 80 to smiles
		Converting 81 to smiles
		Converting 82 to smiles
		Converting 83 to smiles
		Converting 84 to smiles
		Converting 85 to smiles
		Converting 86 to smiles
		Converting 87 to smiles
		Converting 88 to smiles
		Converting 89 to smiles
		Converting 90 to smiles
		Converting 91 to smiles
		Converting 92 to smiles
		Converting 93 to smiles
		Converting 94 to smiles
		Converting 95 to smiles
		Converting 96 to smiles
		Converting 97 to smiles
		Converting 98 to smiles
		Converting 99 to smiles
		Converting 100 to smiles
		Converting 101 to smiles
		Converting 102 to smiles
		Converting 103 to smiles
		Converting 104 to smiles
		Converting 105 to smiles
		Converting 106 to smiles
		Converting 107 to smiles
		Converting 108 to smiles
		Converting 109 to smiles
		Converting 110 to smiles
		Converting 111 to smiles
		Converting 112 to smiles
		Converting 113 to smiles
		Converting 114 to smiles
		Converting 115 to smiles
		Converting 116 to smiles
		Converting 117 to smiles
		Converting 118 to smiles
		Converting 119 to smiles
		Converting 120 to smiles
		Converting 121 to smiles
		Converting 122 to smiles
		Converting 123 to smiles
		Converting 124 to smiles
		Converting 125 to smiles
		Converting 126 to smiles
		Converting 127 to smiles
		Converting 128 to smiles
		Converting 129 to smiles
		Converting 130 to smiles
		Converting 131 to smiles
		Converting 132 to smiles
		Converting 133 to smiles
		Converting 134 to smiles
		Converting 135 to smiles
		Converting 136 to smiles
		Converting 137 to smiles
		Converting 138 to smiles
		Converting 139 to smiles
		Converting 140 to smiles
		Converting 141 to smiles
		Converting 142 to smiles
		Converting 143 to smiles
		Converting 144 to smiles
		Converting 145 to smiles
		Converting 146 to smiles
		Converting 147 to smiles
		Converting 148 to smiles
		Converting 149 to smiles
		Converting 150 to smiles
		Converting 151 to smiles
		Converting 152 to smiles
		Converting 153 to smiles
		Converting 154 to smiles
		Converting 155 to smiles
		Converting 156 to smiles
		Converting 157 to smiles
		Converting 158 to smiles
		Converting 159 to smiles
		Converting 160 to smiles
		Converting 161 to smiles
		Converting 162 to smiles
		Errored on 162
		Converting 163 to smiles
		Converting 164 to smiles
		Converting 165 to smiles
		Converting 166 to smiles
		Converting 167 to smiles
		Converting 168 to smiles
		Converting 169 to smiles
		Converting 170 to smiles
		Converting 171 to smiles
		Converting 172 to smiles
		Converting 173 to smiles
		Converting 174 to smiles
		Converting 175 to smiles
		Converting 176 to smiles
		Converting 177 to smiles
		Converting 178 to smiles
		Converting 179 to smiles
		Converting 180 to smiles
		Converting 181 to smiles
		Converting 182 to smiles
		Converting 183 to smiles
		Converting 184 to smiles
		Converting 185 to smiles
		Converting 186 to smiles
		Converting 187 to smiles
		Converting 188 to smiles
		Converting 189 to smiles
		Converting 190 to smiles
		Converting 191 to smiles
		Converting 192 to smiles
		Converting 193 to smiles
		Converting 194 to smiles
		Converting 195 to smiles
		Converting 196 to smiles
		Converting 197 to smiles
		Converting 198 to smiles
		Converting 199 to smiles
		Converting 200 to smiles
		Converting 201 to smiles
		Converting 202 to smiles
		Converting 203 to smiles
		Converting 204 to smiles
		Converting 205 to smiles
		Converting 206 to smiles
		Converting 207 to smiles
		Converting 208 to smiles
		Converting 209 to smiles
		Converting 210 to smiles
		Converting 211 to smiles
		Converting 212 to smiles
		Converting 213 to smiles
		Converting 214 to smiles
		Converting 215 to smiles
		Converting 216 to smiles
		Converting 217 to smiles
		Converting 218 to smiles
		Converting 219 to smiles
		Converting 220 to smiles
		Converting 221 to smiles
		Converting 222 to smiles
		Converting 223 to smiles
		Converting 224 to smiles
		Converting 225 to smiles
		Converting 226 to smiles
		Converting 227 to smiles
		Converting 228 to smiles
		Converting 229 to smiles
		Converting 230 to smiles
		Converting 231 to smiles
		Converting 232 to smiles
		Converting 233 to smiles
		Converting 234 to smiles
		Converting 235 to smiles
		Converting 236 to smiles
		Converting 237 to smiles
		Converting 238 to smiles
		Converting 239 to smiles
		Converting 240 to smiles
		Converting 241 to smiles
		Converting 242 to smiles
		Converting 243 to smiles
		Converting 244 to smiles
		Converting 245 to smiles
		Converting 246 to smiles
		Converting 247 to smiles
		Converting 248 to smiles
		Converting 249 to smiles
		Converting 250 to smiles
		Converting 251 to smiles
		Converting 252 to smiles
		Converting 253 to smiles
		Converting 254 to smiles
		Converting 255 to smiles
		Converting 256 to smiles
		Converting 257 to smiles
		Converting 258 to smiles
		Converting 259 to smiles
		Converting 260 to smiles
		Converting 261 to smiles
		Converting 262 to smiles
		Converting 263 to smiles
		Converting 264 to smiles
		Converting 265 to smiles
		Converting 266 to smiles
		Converting 267 to smiles
		Converting 268 to smiles
		Converting 269 to smiles
		Converting 270 to smiles
		Converting 271 to smiles
		Converting 272 to smiles
		Converting 273 to smiles
		Converting 274 to smiles
		Converting 275 to smiles
		Converting 276 to smiles
		Converting 277 to smiles
		Converting 278 to smiles
		Converting 279 to smiles
		Converting 280 to smiles
		Converting 281 to smiles
		Converting 282 to smiles
		Converting 283 to smiles
		Converting 284 to smiles
		Converting 285 to smiles
		Converting 286 to smiles
		Converting 287 to smiles
		Converting 288 to smiles
		Converting 289 to smiles
		Converting 290 to smiles
		Converting 291 to smiles
		Converting 292 to smiles
		Converting 293 to smiles
		Converting 294 to smiles
		Converting 295 to smiles
		Converting 296 to smiles
		Converting 297 to smiles
		Converting 298 to smiles
		Converting 299 to smiles
		Converting 300 to smiles
		Converting 301 to smiles
		Converting 302 to smiles
		Converting 303 to smiles
		Errored on 303
		Converting 304 to smiles
		Converting 305 to smiles
		Converting 306 to smiles
		Converting 307 to smiles
		Converting 308 to smiles
		Converting 309 to smiles
		Converting 310 to smiles
		Converting 311 to smiles
		Converting 312 to smiles
		Converting 313 to smiles
		Converting 314 to smiles
		Converting 315 to smiles
		Converting 316 to smiles
		Converting 317 to smiles
		Converting 318 to smiles
		Converting 319 to smiles
		Converting 320 to smiles
		Converting 321 to smiles
		Converting 322 to smiles
		Converting 323 to smiles
		Converting 324 to smiles
		Converting 325 to smiles
		Converting 326 to smiles
		Converting 327 to smiles
		Converting 328 to smiles
		Converting 329 to smiles
		Converting 330 to smiles

		Converting 331 to smiles
		Converting 332 to smiles
		Converting 333 to smiles
		Converting 334 to smiles
		Converting 335 to smiles
		Converting 336 to smiles
		Converting 337 to smiles
		Converting 338 to smiles
		Converting 339 to smiles
		Converting 340 to smiles
		Converting 341 to smiles
		Converting 342 to smiles
		Converting 343 to smiles
		Converting 344 to smiles
		Converting 345 to smiles
		Converting 346 to smiles
		Converting 347 to smiles
		Converting 348 to smiles
		Converting 349 to smiles
		Converting 350 to smiles
		Converting 351 to smiles
		Converting 352 to smiles
		Converting 353 to smiles
		Converting 354 to smiles
		Converting 355 to smiles
		Converting 356 to smiles
		Converting 357 to smiles
		Converting 358 to smiles
		Converting 359 to smiles
		Converting 360 to smiles
		Converting 361 to smiles
		Converting 362 to smiles
		Converting 363 to smiles
		Converting 364 to smiles
		Converting 365 to smiles
		Converting 366 to smiles
		Converting 367 to smiles
		Converting 368 to smiles
		Converting 369 to smiles
		Converting 370 to smiles
		Converting 371 to smiles
		Converting 372 to smiles
		Converting 373 to smiles
		Converting 374 to smiles
		Converting 375 to smiles
		Converting 376 to smiles
		Converting 377 to smiles
		Converting 378 to smiles
		Converting 379 to smiles
		Converting 380 to smiles
		Converting 381 to smiles
		Converting 382 to smiles
		Converting 383 to smiles
		Converting 384 to smiles
		Converting 385 to smiles
		Converting 386 to smiles
		Converting 387 to smiles
		Converting 388 to smiles
		Converting 389 to smiles
		Converting 390 to smiles
		Converting 391 to smiles
		Converting 392 to smiles
		Converting 393 to smiles
		Converting 394 to smiles
		Converting 395 to smiles
		Converting 396 to smiles
		Converting 397 to smiles
		Converting 398 to smiles
		Converting 399 to smiles
		Converting 400 to smiles
		Converting 401 to smiles
		Converting 402 to smiles
		Converting 403 to smiles
		Converting 404 to smiles
		Converting 405 to smiles
		Converting 406 to smiles
		Converting 407 to smiles
		Converting 408 to smiles
		Converting 409 to smiles
		Converting 410 to smiles
		Converting 411 to smiles
		Converting 412 to smiles
		Converting 413 to smiles
		Converting 414 to smiles
		Converting 415 to smiles
		Converting 416 to smiles
		Converting 417 to smiles
		Converting 418 to smiles
		Converting 419 to smiles
		Converting 420 to smiles
		Converting 421 to smiles
		Converting 422 to smiles
		Converting 423 to smiles
		Converting 424 to smiles
		Converting 425 to smiles
		Converting 426 to smiles
		Converting 427 to smiles
		Converting 428 to smiles
		Converting 429 to smiles

		%% Cell type:code id: tags:

		``` python
		smiles_data = raw_data
		# map drug name to smiles string
		smiles_data['drug'] = smiles_data['drug'].apply(lambda x: smiles_map[x] if x in smiles_map else None)
		```

		%% Cell type:code id: tags:

		``` python
		# preview smiles data
		smiles_data.loc[smiles_data.index[:5]]
		```

		%% Output

		label drug n1 n2
		0 2 CC1(C(N2C(S1)C(C2=O)NC(=O)COC3=CC=CC=C3)C(=O)[... -10.404 -18.1929
		1 3 CC1=C2COC(=O)C2=C(C(=C1OC)CC=C(C)CCC(=O)OCCN3C... -12.4453 -11.7175
		2 4 CC1=CC(=CC(=C1)OCC2CNC(=O)O2)C -8.65572 -17.7753
		3 5 COC1=C(C=C2C(=C1)C(=NC(=N2)N3CCN(CC3)C(=O)C4CC... -11.5048 16.0825
		4 6 CC(C)N1C2=CC=CC=C2C(=C1C=CC(CC(CC(=O)[O-])O)O)... -11.1354 -14.553

		%% Cell type:markdown id: tags:

		Hooray, we have mapped each drug name to its corresponding smiles code.

		Now, we need to look at the data and remove as much noise as possible.

		%% Cell type:markdown id: tags:

		## De-noising data

		%% Cell type:markdown id: tags:

		In machine learning, we know that there is no free lunch. You will need to spend time analyzing and understanding your data in order to frame your problem and determine the appropriate model framework. Treatment of your data will depend on the conclusions you gather from this process.

		Questions to ask yourself:
		* What are you trying to accomplish?
		* What is your assay?
		* What is the structure of the data?
		* Does the data make sense?
		* What has been tried previously?

		For this project (respectively):
		* I would like to build a model capable of predicting the affinity of an arbitrary small molecule drug to a particular ion channel protein
		* For an input drug, data describing channel inhibition
		* A few hundred drugs, with n=2
		* Will need to look more closely at the dataset*
		* Nothing on this particular protein

		%% Cell type:markdown id: tags:

		*This will involve plotting, so we will import matplotlib and seaborn. We will also need to look at molecular structures, so we will import rdkit. We will also use the seaborn library which you can install with `pip install seaborn`.

		%% Cell type:code id: tags:

		``` python
		import matplotlib.pyplot as plt
		%matplotlib inline

		import seaborn as sns
		sns.set_style('white')
		```

		%% Cell type:code id: tags:

		``` python
		from rdkit import Chem
		from rdkit.Chem import AllChem
		from rdkit.Chem import Draw, PyMol, rdFMCS
		from rdkit.Chem.Draw import IPythonConsole
		from rdkit import rdBase
		```

		%% Output

		RDKit WARNING: [17:10:53] Enabling RDKit 2019.09.3 jupyter extensions

		%% Cell type:code id: tags:

		``` python
		# i will use numpy on occasion for manipulating arrays
		import numpy as np
		```

		%% Cell type:markdown id: tags:

		Our goal is to build a small molecule model, so let's make sure our molecules are all small. This can be approximated by the length of each smiles string.

		%% Cell type:code id: tags:

		``` python
		smiles_data['len'] = [len(i) if i is not None else 0 for i in smiles_data['drug']]
		```

		%% Cell type:code id: tags:

		``` python
		smiles_lens = [len(i) if i is not None else 0 for i in smiles_data['drug']]
		sns.distplot(smiles_lens)
		plt.xlabel('len(smiles)')
		plt.ylabel('probability')
		```

		%% Output

		Text(0,0.5,'probability')



		%% Cell type:markdown id: tags:

		Some of these look rather large, len(smiles) > 150. Let's see what they look like.

		%% Cell type:code id: tags:

		``` python
		# indices of large looking molecules
		suspiciously_large = np.where(np.array(smiles_lens) > 150)[0]

		# corresponding smiles string
		long_smiles = smiles_data.loc[smiles_data.index[suspiciously_large]]['drug'].values
		```

		%% Cell type:code id: tags:

		``` python
		# look
		Draw._MolsToGridImage([Chem.MolFromSmiles(i) for i in long_smiles], molsPerRow=6)
		```

		%% Output


		<PIL.PngImagePlugin.PngImageFile image mode=RGB size=1200x200 at 0x11B19C7B8>

		%% Cell type:markdown id: tags:

		As suspected, these are not small molecules, so we will remove them from the dataset. The argument here is that these molecules could register as inhibitors simply because they are large. They are more likely to sterically blocks the channel, rather than diffuse inside and bind (which is what we are interested in).

		The lesson here is to remove data that does not fit your use case.

		%% Cell type:code id: tags:

		``` python
		# drop large molecules
		smiles_data = smiles_data[~smiles_data['drug'].isin(long_smiles)]
		```

		%% Cell type:markdown id: tags:

		Now, let's look at the numerical structure of the dataset.

		First, check for NaNs.

		%% Cell type:code id: tags:

		``` python
		nan_rows = smiles_data[smiles_data.isnull().T.any().T]
		nan_rows[['n1', 'n2']]
		```

		%% Output

		n1 n2
		62 NaN -7.8266
		162 -12.8456 -11.4627
		175 NaN -6.61225
		187 NaN -8.23326
		233 -8.21781 NaN
		262 NaN -12.8788
		288 NaN -2.34264
		300 NaN -8.19936
		301 NaN -10.4633
		303 -5.61374 8.42267
		311 NaN -8.78722

		%% Cell type:markdown id: tags:

		I don't trust n=1, so I will throw these out.

		Then, let's examine the distribution of n1 and n2.

		%% Cell type:code id: tags:

		``` python
		df = smiles_data.dropna(axis=0, how='any')
		```

		%% Cell type:code id: tags:

		``` python
		# seaborn jointplot will allow us to compare n1 and n2, and plot each marginal
		sns.jointplot('n1', 'n2', data=smiles_data)
		```

		%% Output

		<seaborn.axisgrid.JointGrid at 0x1a20d49ef0>



		%% Cell type:markdown id: tags:

		We see that most of the data is contained in the gaussian-ish blob centered a bit below zero. We see that there are a few clearly active datapoints located in the bottom left, and one on the top right. These are all distinguished from the majority of the data. How do we handle the data in the blob?

		Because n1 and n2 represent the same measurement, ideally they would be of the same value. This plot should be tightly aligned to the diagonal, and the pearson correlation coefficient should be 1. We see this is not the case. This helps gives us an idea of the error of our assay.

		Let's look at the error more closely, plotting in the distribution of (n1-n2).

		%% Cell type:code id: tags:

		``` python
		diff_df = df['n1'] - df['n2']

		sns.distplot(diff_df)
		plt.xlabel('difference in n')
		plt.ylabel('probability')
		```

		%% Output

		Text(0,0.5,'probability')



		%% Cell type:markdown id: tags:

		This looks pretty gaussian, let's get the 95% confidence interval by fitting a gaussian via scipy, and taking 2*the standard deviation

		%% Cell type:code id: tags:

		``` python
		from scipy import stats
		```

		%% Cell type:code id: tags:

		``` python
		mean, std = stats.norm.fit(np.asarray(diff_df, dtype=np.float32))
		```

		%% Cell type:code id: tags:

		``` python
		ci_95 = std*2
		ci_95
		```

		%% Output

		17.75387954711914

		%% Cell type:markdown id: tags:

		Now, I don't trust the data outside of the confidence interval, and will therefore drop these datapoints from df.

		For example, in the plot above, at least one datapoint has n1-n2 > 60. This is disconcerting.

		%% Cell type:code id: tags:

		``` python
		noisy = diff_df[abs(diff_df) > ci_95]
		df = df.drop(noisy.index)
		```

		%% Cell type:code id: tags:

		``` python
		sns.jointplot('n1', 'n2', data=df)
		```

		%% Output

		<seaborn.axisgrid.JointGrid at 0x1a211eb5f8>



		%% Cell type:markdown id: tags:

		Now that data looks much better!

		So, let's average n1 and n2, and take the error bar to be ci_95.

		%% Cell type:code id: tags:

		``` python
		avg_df = df[['label', 'drug']]
		n_avg = df[['n1', 'n2']].mean(axis=1)
		avg_df['n'] = n_avg
		avg_df.sort_values('n', inplace=True)
		```

		%% Output

		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:
		A value is trying to be set on a copy of a slice from a DataFrame.
		Try using .loc[row_indexer,col_indexer] = value instead

		See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
		This is separate from the ipykernel package so we can avoid doing imports until
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning:
		A value is trying to be set on a copy of a slice from a DataFrame

		See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
		after removing the cwd from sys.path.

		%% Cell type:markdown id: tags:

		Now, let's look at the sorted data with error bars.

		%% Cell type:code id: tags:

		``` python
		plt.errorbar(np.arange(avg_df.shape[0]), avg_df['n'], yerr=ci_95, fmt='o')
		plt.xlabel('drug, sorted')
		plt.ylabel('activity')
		```

		%% Output

		Text(0,0.5,'activity')



		%% Cell type:markdown id: tags:

		Now, let's identify our active compounds.

		In my case, this required domain knowledge. Having worked in this area, and having consulted with professors specializing on this channel, I am interested in compounds where the absolute value of the activity is greater than 25. This relates to the desired drug potency we would like to model.

		If you are not certain how to draw the line between active and inactive, this cutoff could potentially be treated as a hyperparameter.

		%% Cell type:code id: tags:

		``` python
		actives = avg_df[abs(avg_df['n'])-ci_95 > 25]['n']

		plt.errorbar(np.arange(actives.shape[0]), actives, yerr=ci_95, fmt='o')
		```

		%% Output

		<ErrorbarContainer object of 3 artists>



		%% Cell type:code id: tags:

		``` python
		# summary
		print (raw_data.shape, avg_df.shape, len(actives.index))
		```

		%% Output

		(430, 5) (392, 3) 6

		%% Cell type:markdown id: tags:

		In summary, we have:
		* Removed data that did not address the question we hope to answer (small molecules only)
		* Dropped NaNs
		* Determined the noise of our measurements
		* Removed exceptionally noisy datapoints
		* Identified actives (using domain knowledge to determine a threshold)

		%% Cell type:markdown id: tags:

		## Determine model type, final form of dataset, and sanity load

		%% Cell type:markdown id: tags:

		Now, what model framework should we use?

		Given that we have 392 datapoints and 6 actives, this data will be used to build a low data one-shot classifier (10.1021/acscentsci.6b00367). If there were datasets of similar character, transfer learning could potentially be used, but this is not the case at the moment.


		Let's apply logic to our dataframe in order to cast it into a binary format, suitable for classification.

		%% Cell type:code id: tags:

		``` python
		# 1 if condition for active is met, 0 otherwise
		avg_df['active'] = (abs(avg_df['n'])-ci_95 > 25).astype(int)
		```

		%% Output

		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
		A value is trying to be set on a copy of a slice from a DataFrame.
		Try using .loc[row_indexer,col_indexer] = value instead

		See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


		%% Cell type:markdown id: tags:

		Now, save this to file.

		%% Cell type:code id: tags:

		``` python
		avg_df.to_csv('modulators.csv', index=False)
		```

		%% Cell type:markdown id: tags:

		Now, we will convert this dataframe to a DeepChem dataset.

		%% Cell type:code id: tags:

		``` python
		import deepchem as dc
		```

		%% Output

		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
		warnings.warn(msg, category=FutureWarning)
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_qint8 = np.dtype([("qint8", np.int8, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_qint16 = np.dtype([("qint16", np.int16, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_qint32 = np.dtype([("qint32", np.int32, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		np_resource = np.dtype([("resource", np.ubyte, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_qint8 = np.dtype([("qint8", np.int8, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_qint16 = np.dtype([("qint16", np.int16, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_qint32 = np.dtype([("qint32", np.int32, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		np_resource = np.dtype([("resource", np.ubyte, 1)])

		%% Cell type:code id: tags:

		``` python
		dataset_file = 'modulators.csv'
		task = ['active']
		featurizer_func = dc.feat.ConvMolFeaturizer()

		loader = dc.data.CSVLoader(tasks=task, smiles_field='drug', featurizer=featurizer_func)
		dataset = loader.featurize(dataset_file)
		```

		%% Output

		Loading raw samples now.
		shard_size: 8192
		About to start loading CSV from modulators.csv
		Loading shard 1 of size 8192.
		Featurizing sample 0
		TIMING: featurizing shard 0 took 0.689 s
		TIMING: dataset construction took 0.825 s
		Loading dataset from disk.

		%% Cell type:markdown id: tags:

		Lastly, it is often advantageous to numerically transform the data in some way. For example, sometimes it is useful to normalize the data, or to zero the mean. This depends in the task at hand.

		Built into DeepChem are many useful transformers, located in the deepchem.transformers.transformers base class.

		Because this is a classification model, and the number of actives is low, I will apply a balancing transformer. I treated this transformer as a hyperparameter when I began training models. It proved to unambiguously improve model performance.

		%% Cell type:code id: tags:

		``` python
		transformer = dc.trans.BalancingTransformer(transform_w=True, dataset=dataset)
		dataset = transformer.transform(dataset)
		```

		%% Output

		TIMING: dataset construction took 0.160 s
		Loading dataset from disk.

		%% Cell type:markdown id: tags:

		Now let's save the balanced dataset object to disk, and then reload it as a sanity check.

		%% Cell type:code id: tags:

		``` python
		dc.utils.save.save_to_disk(dataset, 'balanced_dataset.joblib')
		balanced_dataset = dc.utils.save.load_from_disk('balanced_dataset.joblib')
		```

		%% Cell type:markdown id: tags:

		Tutorial written by Keri McKiernan (github.com/kmckiern) on September 8, 2016

examples/notebooks/Data Featurization Introduction.ipynb

deleted100644 → 0

+0 −29

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		### Input Formats
		DeepChem supports a whole range of input files. For example, accepted input formats for deepchem include .csv, .sdf, .fasta, .png, .tif and other file formats. The loading for a particular file format is governed by `Loader` class associated with that format. For example, with a csv input, we use the `CSVLoader` class underneath the hood. Here's an example of a sample .csv file that fits the requirements of `CSVLoader`.

		1. A column containing SMILES strings [1].
		2. A column containing an experimental measurement.
		3. (Optional) A column containing a unique compound identifier.

		Here's an example of a potential input file.

		\|Compound ID \| measured log solubility in mols per litre \| smiles \|
		\|---------------\|-------------------------------------------\|----------------\|
		\| benzothiazole \| -1.5 \| c2ccc1scnc1c2 \|


		Here the "smiles" column contains the SMILES string, the "measured log
		solubility in mols per litre" contains the experimental measurement and
		"Compound ID" contains the unique compound identifier.

		[2] Anderson, Eric, Gilman D. Veith, and David Weininger. "SMILES, a line
		notation and computerized interpreter for chemical structures." US
		Environmental Protection Agency, Environmental Research Laboratory, 1987.

		### Data Featurization

		Most machine learning algorithms require that input data form vectors. However, input data for drug-discovery datasets routinely come in the format of lists of molecules and associated experimental readouts. To
		transform lists of molecules into vectors, we need to subclasses of DeepChem loader class ```dc.data.DataLoader``` such as ```dc.data.CSVLoader``` or ```dc.data.SDFLoader```. Users can subclass ```dc.data.DataLoader``` to
		load arbitrary file formats. All loaders must be passed a ```dc.feat.Featurizer``` object. DeepChem provides a number of different subclasses of ```dc.feat.Featurizer``` for convenience.

examples/notebooks/Synthetic_Feasibility_Scoring.ipynb

+140 −44

File changed.

Preview size limit exceeded, changes collapsed.

examples/notebooks/Deepchem_NumpyDataset_tutorial.ipynb→examples/notebooks/Using_DeepChem_Datasets.ipynb

+34 −37

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		# Using Deepchem Datasets
		In this tutorial we will have a look at various deepchem `dataset` methods present in `deepchem.datasets`.

		%% Cell type:code id: tags:

		``` python
		import deepchem as dc
		import numpy as np
		import random
		```

		%% Output

		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
		warnings.warn(msg, category=FutureWarning)
		RDKit WARNING: [17:49:31] Enabling RDKit 2019.09.3 jupyter extensions
		RDKit WARNING: [23:36:48] Enabling RDKit 2019.09.3 jupyter extensions
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_qint8 = np.dtype([("qint8", np.int8, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_qint16 = np.dtype([("qint16", np.int16, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_qint32 = np.dtype([("qint32", np.int32, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		np_resource = np.dtype([("resource", np.ubyte, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_qint8 = np.dtype([("qint8", np.int8, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_qint16 = np.dtype([("qint16", np.int16, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_qint32 = np.dtype([("qint32", np.int32, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		np_resource = np.dtype([("resource", np.ubyte, 1)])

		%% Cell type:markdown id: tags:

		# Using NumpyDatasets

		The `dc.data.NumpyDatasets` class is used when you have your data in numpy arrays. It provides a simple wrapper around a collection of Numpy datasets.

		%% Cell type:code id: tags:

		``` python
		# data is your dataset in numpy array of size : 20x20.
		data = np.random.random((4, 4))
		labels = np.random.random((4,)) # labels of size 20x1
		```

		%% Cell type:code id: tags:

		``` python
		from deepchem.data.datasets import NumpyDataset # import NumpyDataset
		```

		%% Cell type:code id: tags:

		``` python
		dataset = NumpyDataset(data, labels) # creates numpy dataset object
		dataset
		```

		%% Output

		<deepchem.data.datasets.NumpyDataset at 0x1a3a047a20>
		<deepchem.data.datasets.NumpyDataset at 0x1a38c2beb8>

		%% Cell type:markdown id: tags:

		## Extracting X, y from NumpyDataset Object
		Extracting the data and labels from the NumpyDataset is very easy.

		%% Cell type:code id: tags:

		``` python
		dataset.X # Extracts the data (X) from the NumpyDataset Object
		```

		%% Output

		array([[0.77747145, 0.83258316, 0.76509785, 0.36074566],
		[0.28224673, 0.79519759, 0.93776705, 0.2213494 ],
		[0.54740751, 0.38403327, 0.12592795, 0.94350571],
		[0.02717497, 0.75938816, 0.9477633 , 0.80792975]])
		array([[0.01109257, 0.62277556, 0.02058281, 0.83395641],
		[0.29591717, 0.69673525, 0.67001604, 0.68693823],
		[0.84156563, 0.02039639, 0.83506678, 0.4422977 ],
		[0.39966698, 0.57210768, 0.4434791 , 0.16909073]])

		%% Cell type:code id: tags:

		``` python
		dataset.y # Extracts the labels (y) from the NumpyDataset Object
		```

		%% Output

		array([0.76098746, 0.12423036, 0.24516253, 0.84793405])
		array([0.90696188, 0.45977404, 0.96922696, 0.24167064])

		%% Cell type:markdown id: tags:

		## Weights of a dataset - w
		So apart from `X` and `y` which are the data and the labels, you can also assign weights `w` to each data instance. The dimension of `w` is same as that of `y`(which is `Nx1` where `N` is the number of data instances).

		NOTE: By default `w` is a vector initialized with equal weights (all being 1).

		%% Cell type:code id: tags:

		``` python
		dataset.w # printing the weights that are assigned by default. Notice that they are a vector of 1's
		```

		%% Output

		array([[1.],
		[1.],
		[1.],
		[1.]])
		array([1., 1., 1., 1.], dtype=float32)

		%% Cell type:code id: tags:

		``` python
		w = np.random.random((4,)) # initializing weights with random vector of size 20x1
		dataset_with_weights = NumpyDataset(data, labels, w) # creates numpy dataset object
		```

		%% Cell type:code id: tags:

		``` python
		dataset_with_weights.w
		```

		%% Output

		array([0.10909932, 0.54252096, 0.70115951, 0.39749864])
		array([0.76645723, 0.44698502, 0.34730918, 0.40243847])

		%% Cell type:markdown id: tags:

		## Iterating over NumpyDataset
		In order to iterate over NumpyDataset, we use `itersamples` method. We iterate over 4 quantities, namely `X`, `y`, `w` and `ids`. The first three quantities are the same as discussed above and `ids` is the id of the data instance. By default the id is given in order starting from `1`

		%% Cell type:code id: tags:

		``` python
		for x, y, w, id in dataset.itersamples():
		print(x, y, w, id)
		```

		%% Output

		[0.77747145 0.83258316 0.76509785 0.36074566] 0.7609874556128873 1.0 0
		[0.28224673 0.79519759 0.93776705 0.2213494 ] 0.1242303578243128 1.0 1
		[0.54740751 0.38403327 0.12592795 0.94350571] 0.2451625327575474 1.0 2
		[0.02717497 0.75938816 0.9477633 0.80792975] 0.8479340478005098 1.0 3
		[0.01109257 0.62277556 0.02058281 0.83395641] 0.9069618791084421 1.0 0
		[0.29591717 0.69673525 0.67001604 0.68693823] 0.45977403789888005 1.0 1
		[0.84156563 0.02039639 0.83506678 0.4422977 ] 0.9692269600648693 1.0 2
		[0.39966698 0.57210768 0.4434791 0.16909073] 0.24167063686752832 1.0 3

		%% Cell type:markdown id: tags:

		You can also extract the ids by `dataset.ids`. This would return a numpy array consisting of the ids of the data instances.

		%% Cell type:code id: tags:

		``` python
		dataset.ids
		```

		%% Output

		array([0, 1, 2, 3], dtype=object)

		%% Cell type:markdown id: tags:

		## MNIST Example
		Just to get a better understanding, lets take read MNIST data and use `NumpyDataset` to store the data.

		%% Cell type:code id: tags:

		``` python
		from tensorflow.examples.tutorials.mnist import input_data
		```

		%% Cell type:code id: tags:

		``` python
		mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
		```

		%% Output

		WARNING:tensorflow:From <ipython-input-14-a839aeb82f4b>:1: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
		WARNING:tensorflow:From <ipython-input-13-a839aeb82f4b>:1: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
		Instructions for updating:
		Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
		WARNING:tensorflow:From /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
		Instructions for updating:
		Please write your own downloading logic.
		WARNING:tensorflow:From /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
		Instructions for updating:
		Please use tf.data to implement this functionality.
		Extracting MNIST_data/train-images-idx3-ubyte.gz
		WARNING:tensorflow:From /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
		Instructions for updating:
		Please use tf.data to implement this functionality.
		Extracting MNIST_data/train-labels-idx1-ubyte.gz
		WARNING:tensorflow:From /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:110: dense_to_one_hot (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
		Instructions for updating:
		Please use tf.one_hot on tensors.
		Extracting MNIST_data/t10k-images-idx3-ubyte.gz
		Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
		WARNING:tensorflow:From /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: DataSet.__init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
		Instructions for updating:
		Please use alternatives such as official/mnist/dataset.py from tensorflow/models.

		%% Cell type:code id: tags:

		``` python
		# Load the numpy data of MNIST into NumpyDataset
		train = NumpyDataset(mnist.train.images, mnist.train.labels)
		valid = NumpyDataset(mnist.validation.images, mnist.validation.labels)
		```

		%% Cell type:code id: tags:

		``` python
		import matplotlib.pyplot as plt
		```

		%% Cell type:code id: tags:

		``` python
		# Visualize one sample
		sample = np.reshape(train.X[5], (28, 28))
		plt.imshow(sample)
		plt.show()
		```

		%% Output



		%% Cell type:markdown id: tags:

		## Converting a Numpy Array to tf.data.dataset()


		Let's say you want to use the `tf.data` module instead of DeepChem's data handling library. Doing this is straightforward and is quite similar to getting a `NumpyDataset` object from numpy arrays.

		%% Cell type:code id: tags:

		``` python
		import tensorflow as tf
		data_small = np.random.random((4,5))
		label_small = np.random.random((4,))
		dataset = tf.data.Dataset.from_tensor_slices((data_small, label_small))
		print ("Data\n")
		print (data_small)
		print ("\n Labels")
		print (label_small)
		```

		%% Output

		Data

		[[0.09102272 0.07158817 0.85294433 0.72889589 0.00564065]
		[0.26971883 0.51840485 0.69322473 0.85085169 0.11202028]
		[0.14868434 0.83661216 0.32333968 0.64312229 0.44279518]
		[0.15123109 0.3443811 0.04610284 0.66125549 0.26025301]]
		[[0.78113293 0.01674453 0.48489516 0.69356293 0.91605677]
		[0.70025413 0.66522493 0.03279785 0.1810656 0.34951665]
		[0.8357952 0.68600992 0.19022591 0.6087858 0.61117143]
		[0.02318132 0.85849407 0.31825101 0.83070808 0.13985736]]

		Labels
		[0.75613112 0.97179618 0.33262846 0.54677704]
		[0.41737946 0.83331863 0.89246031 0.21424502]

		%% Cell type:markdown id: tags:

		## Extracting the numpy dataset from tf.data

		In order to extract the numpy array from the `tf.data`, you first need to define an `iterator` to iterate over the `tf.data.Dataset` object and then in the tensorflow session, run over the iterator to get the data instances. Let's have a look at how it's done.

		%% Cell type:code id: tags:

		``` python
		iterator = dataset.make_one_shot_iterator() # iterator
		next_element = iterator.get_next()
		numpy_data = np.zeros((4, 5))
		numpy_label = np.zeros((4,))
		sess = tf.Session() # tensorflow session
		for i in range(4):
		data_, label_ = sess.run(next_element) # data_ contains the data and label_ contains the labels that we fed in the previous step
		numpy_data[i, :] = data_
		numpy_label[i] = label_

		print ("Numpy Data")
		print(numpy_data)
		print ("\n Numpy Label")
		print(numpy_label)
		```

		%% Output

		WARNING:tensorflow:From <ipython-input-19-f67e6d094179>:1: DatasetV1.make_one_shot_iterator (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
		WARNING:tensorflow:From <ipython-input-18-f67e6d094179>:1: DatasetV1.make_one_shot_iterator (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
		Instructions for updating:
		Use `for ... in dataset:` to iterate over a dataset. If using `tf.estimator`, return the `Dataset` object directly from your input function. As a last resort, you can use `tf.compat.v1.data.make_one_shot_iterator(dataset)`.
		Numpy Data
		[[0.09102272 0.07158817 0.85294433 0.72889589 0.00564065]
		[0.26971883 0.51840485 0.69322473 0.85085169 0.11202028]
		[0.14868434 0.83661216 0.32333968 0.64312229 0.44279518]
		[0.15123109 0.3443811 0.04610284 0.66125549 0.26025301]]
		[[0.78113293 0.01674453 0.48489516 0.69356293 0.91605677]
		[0.70025413 0.66522493 0.03279785 0.1810656 0.34951665]
		[0.8357952 0.68600992 0.19022591 0.6087858 0.61117143]
		[0.02318132 0.85849407 0.31825101 0.83070808 0.13985736]]

		Numpy Label
		[0.75613112 0.97179618 0.33262846 0.54677704]
		[0.41737946 0.83331863 0.89246031 0.21424502]

		%% Cell type:markdown id: tags:

		Now that you have the numpy arrays of `data` and `labels`, you can convert it to `NumpyDataset`.

		%% Cell type:code id: tags:

		``` python
		dataset_ = NumpyDataset(numpy_data, numpy_label) # convert to NumpyDataset
		dataset_.X # printing just to check if the data is same!!
		```

		%% Output

		array([[0.09102272, 0.07158817, 0.85294433, 0.72889589, 0.00564065],
		[0.26971883, 0.51840485, 0.69322473, 0.85085169, 0.11202028],
		[0.14868434, 0.83661216, 0.32333968, 0.64312229, 0.44279518],
		[0.15123109, 0.3443811 , 0.04610284, 0.66125549, 0.26025301]])
		array([[0.78113293, 0.01674453, 0.48489516, 0.69356293, 0.91605677],
		[0.70025413, 0.66522493, 0.03279785, 0.1810656 , 0.34951665],
		[0.8357952 , 0.68600992, 0.19022591, 0.6087858 , 0.61117143],
		[0.02318132, 0.85849407, 0.31825101, 0.83070808, 0.13985736]])

		%% Cell type:markdown id: tags:

		## Converting NumpyDataset to `tf.data`
		This can be easily done by the `make_iterator()` method of `NumpyDataset`. This converts the `NumpyDataset` to `tf.data`. Let's look how it's done!

		%% Cell type:code id: tags:

		``` python
		iterator_ = dataset_.make_iterator() # Using make_iterator for converting NumpyDataset to tf.data
		next_element_ = iterator_.get_next()

		sess = tf.Session() # tensorflow session
		data_and_labels = sess.run(next_element_) # data_ contains the data and label_ contains the labels that we fed in the previous step


		print ("Numpy Data")
		print(data_and_labels[0]) # Data in the first index
		print ("\n Numpy Label")
		print(data_and_labels[1]) # Labels in the second index
		```

		%% Output

		WARNING:tensorflow:From /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py:494: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
		Instructions for updating:
		tf.py_func is deprecated in TF V2. Instead, there are two
		options available in V2.
		- tf.py_function takes a python function which manipulates tf eager
		tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
		an ndarray (just call tensor.numpy()) but having access to eager tensors
		means `tf.py_function`s can use accelerators such as GPUs as well as
		being differentiable using a gradient tape.
		- tf.numpy_function maintains the semantics of the deprecated tf.py_func
		(it is not differentiable, and manipulates numpy arrays). It drops the
		stateful argument making all functions stateful.

		Numpy Data
		[[0.26971883 0.51840485 0.69322473 0.85085169 0.11202028]
		[0.14868434 0.83661216 0.32333968 0.64312229 0.44279518]
		[0.09102272 0.07158817 0.85294433 0.72889589 0.00564065]
		[0.15123109 0.3443811 0.04610284 0.66125549 0.26025301]]
		[[0.02318132 0.85849407 0.31825101 0.83070808 0.13985736]
		[0.78113293 0.01674453 0.48489516 0.69356293 0.91605677]
		[0.8357952 0.68600992 0.19022591 0.6087858 0.61117143]
		[0.70025413 0.66522493 0.03279785 0.1810656 0.34951665]]

		Numpy Label
		[0.97179618 0.33262846 0.75613112 0.54677704]
		[0.21424502 0.41737946 0.89246031 0.83331863]

Admin message