Unverified Commit 63f1e506 authored by Bharath Ramsundar's avatar Bharath Ramsundar Committed by GitHub
Browse files

Merge pull request #1753 from deepchem/cleanup

Continue Cleanup of Notebooks
parents 03f1709e 18f2a2a6
Loading
Loading
Loading
Loading
+29 −1
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Creating a high fidelity dataset from experimental data

%% Cell type:markdown id: tags:

Suppose you were given data collected by an experimental collaborator.  You would like to use this data to construct a machine learning model.

*How do you transform this data into a dataset capable of creating a useful model?*

%% Cell type:markdown id: tags:

Building models from novel data can present several challenges.  Perhaps the data was not recorded in a convenient manner.  Additionally, perhaps the data contains noise.  This is a common occurance with, for example, biological assays due to the large number of external variables and the difficulty and cost associated with collecting multiple samples.  This is a problem because you do not want your model to fit to this noise.

Hence, there are two primary challenges:
* Parsing data
* De-noising data

In this tutorial, will walk through an example of curating a dataset from an excel spreadsheet of experimental drug  measurements.
In this tutorial, will walk through an example of curating a dataset from an excel spreadsheet of experimental drug  measurements. Before we dive into this example though, let's do a brief review of DeepChem's input file handling and featurization capabilities.

### Input Formats
DeepChem supports a whole range of input files. For example, accepted input formats for deepchem include .csv, .sdf, .fasta, .png, .tif and other file formats. The loading for a particular file format is governed by `Loader` class associated with that format. For example, with a csv input, we use the `CSVLoader` class underneath the hood. Here's an example of a sample .csv file that fits the requirements of `CSVLoader`.

1. A column containing SMILES strings [1].
2. A column containing an experimental measurement.
3. (Optional) A column containing a unique compound identifier.

Here's an example of a potential input file.

|Compound ID    | measured log solubility in mols per litre | smiles         |
|---------------|-------------------------------------------|----------------|
| benzothiazole | -1.5                                      | c2ccc1scnc1c2  |


Here the "smiles" column contains the SMILES string, the "measured log
solubility in mols per litre" contains the experimental measurement and
"Compound ID" contains the unique compound identifier.

[2] Anderson, Eric, Gilman D. Veith, and David Weininger. "SMILES, a line
notation and computerized interpreter for chemical structures." US
Environmental Protection Agency, Environmental Research Laboratory, 1987.

### Data Featurization

Most machine learning algorithms require that input data form vectors. However, input data for drug-discovery datasets routinely come in the format of lists of molecules and associated experimental readouts. To
transform lists of molecules into vectors, we need to subclasses of DeepChem loader class ```dc.data.DataLoader``` such as ```dc.data.CSVLoader``` or ```dc.data.SDFLoader```. Users can subclass ```dc.data.DataLoader``` to
load arbitrary file formats. All loaders must be passed a ```dc.feat.Featurizer``` object. DeepChem provides a number of different subclasses of ```dc.feat.Featurizer``` for convenience.

%% Cell type:markdown id: tags:

## Parsing data

%% Cell type:markdown id: tags:

In order to read in the data, we will use the pandas data analysis library.

In order to convert the drug names into smiles strings, we will use pubchempy. This isn't a standard DeepChem dependency, but you can install this library with `pip install pubchempy`.

%% Cell type:code id: tags:

``` python
import os
import pandas as pd
from pubchempy import get_cids, get_compounds
```

%% Cell type:markdown id: tags:

Pandas is magic but it doesn't automatically know where to find your data of interest.  You likely will have to look at it first using a GUI.

We will now look at a screenshot of this dataset as rendered by LibreOffice.

To do this, we will import Image and os.

%% Cell type:code id: tags:

``` python
import os
from IPython.display import Image, display
```

%% Cell type:code id: tags:

``` python
current_dir = os.path.dirname(os.path.realpath('__file__'))
```

%% Cell type:code id: tags:

``` python
data_screenshot = os.path.join(current_dir, 'assets/dataset_preparation_gui.png')
display(Image(filename=data_screenshot))
```

%% Output


%% Cell type:markdown id: tags:

We see the data of interest is on the second sheet, and contained in columns "TA ID", "N #1 (%)", and "N #2 (%)".

Additionally, it appears much of this spreadsheet was formatted for human readability (multicolumn headers, column labels with spaces and symbols, etc.).  This makes the creation of a neat dataframe object harder.  For this reason we will cut everything that is unnecesary or inconvenient.

In order to load Excel data in python, we will use the `xlrd` library under the hood. You can install this library with `pip install xlrd`.

%% Cell type:code id: tags:

``` python
raw_data_file = os.path.join(current_dir, '../../datasets/Positive Modulators Summary_ 918.TUC _ v1.xlsx')
raw_data_excel = pd.ExcelFile(raw_data_file)

# second sheet only
raw_data = raw_data_excel.parse(raw_data_excel.sheet_names[1])
```

%% Cell type:code id: tags:

``` python
# preview 5 rows of raw dataframe
raw_data.loc[raw_data.index[:5]]
```

%% Output

      Unnamed: 0 Unnamed: 1              Unnamed: 2 Metric #1 (-120 mV Peak)  \
    0        NaN        NaN                     NaN                  Vehicle
    1      TA ##   Position                   TA ID                     Mean
    2          1      1-A02  Penicillin V Potassium                 -12.8689
    3          2      1-A03   Mycophenolate Mofetil                 -12.8689
    4          3      1-A04              Metaxalone                 -12.8689
    
      Unnamed: 4                   Unnamed: 5    Unnamed: 6 Unnamed: 7
    0        NaN                            4  Replications        NaN
    1         SD  Threshold (%) = Mean + 4xSD      N #1 (%)   N #2 (%)
    2    6.74705                      14.1193       -10.404   -18.1929
    3    6.74705                      14.1193      -12.4453   -11.7175
    4    6.74705                      14.1193      -8.65572   -17.7753

%% Cell type:markdown id: tags:

Note that the actual row headers are stored in row 1 and not 0 above.

%% Cell type:code id: tags:

``` python
# remove column labels (rows 0 and 1), as we will replace them
# only take data given in columns "TA ID" "N #1 (%)" (3) and "N #2 (%)" (4)
raw_data = raw_data.iloc[2:, [2, 6, 7]]
print(raw_data.loc[raw_data.index[:5]])

## collapse multiindex so that drug names and number indexes are columns
#raw_data.reset_index(level=[1, 2], inplace=True)
# reset the index so we keep the label but number from 0 again
raw_data.reset_index(inplace=True)

## rename columns
raw_data.columns = ['label', 'drug', 'n1', 'n2']
```

%% Output

                   Unnamed: 2 Unnamed: 6 Unnamed: 7
    2  Penicillin V Potassium    -10.404   -18.1929
    3   Mycophenolate Mofetil   -12.4453   -11.7175
    4              Metaxalone   -8.65572   -17.7753
    5           Terazosin·HCl   -11.5048    16.0825
    6          Fluvastatin·Na   -11.1354    -14.553

%% Cell type:code id: tags:

``` python
# preview cleaner dataframe
raw_data.loc[raw_data.index[:5]]
```

%% Output

       label                    drug       n1       n2
    0      2  Penicillin V Potassium  -10.404 -18.1929
    1      3   Mycophenolate Mofetil -12.4453 -11.7175
    2      4              Metaxalone -8.65572 -17.7753
    3      5           Terazosin·HCl -11.5048  16.0825
    4      6          Fluvastatin·Na -11.1354  -14.553

%% Cell type:markdown id: tags:

This formatting is closer to what we need.

Now, let's take the drug names and get smiles strings for them (format needed for DeepChem).

%% Cell type:code id: tags:

``` python
drugs = raw_data['drug'].values
```

%% Cell type:markdown id: tags:

For many of these, we can retreive the smiles string via the canonical_smiles attribute of the `get_compounds` object (using `pubchempy`)

%% Cell type:code id: tags:

``` python
get_compounds(drugs[1], 'name')
```

%% Output

    [Compound(5281078)]

%% Cell type:code id: tags:

``` python
get_compounds(drugs[1], 'name')[0].canonical_smiles
```

%% Output

    'CC1=C2COC(=O)C2=C(C(=C1OC)CC=C(C)CCC(=O)OCCN3CCOCC3)O'

%% Cell type:markdown id: tags:

However, some of these drug names have variables spaces and symbols (·, (±), etc.), and names that may not be readable by pubchempy.

For this task, we will do a bit of hacking via regular expressions.  Also, we notice that all ions are written in a shortened form that will need to be expanded.  For this reason we use a dictionary, mapping the shortened ion names to versions recognizable to pubchempy.

Unfortunately you may have several corner cases that will require more hacking.

%% Cell type:code id: tags:

``` python
ion_replacements = {
    'HBr': ' hydrobromide',
    '2Br': ' dibromide',
    'Br': ' bromide',
    'HCl': ' hydrochloride',
    '2H2O': ' dihydrate',
    'H20': ' hydrate',
    'Na': ' sodium'
}

ion_keys = ['H20', 'HBr', 'HCl', '2Br', '2H2O', 'Br', 'Na']
```

%% Cell type:code id: tags:

``` python
import re
```

%% Cell type:code id: tags:

``` python
def compound_to_smiles(cmpd):
    # remove spaces and irregular characters
    compound = re.sub(r'([^\s\w]|_)+', '', cmpd)

    # replace ion names if needed
    for ion in ion_keys:
        if ion in compound:
            compound = compound.replace(ion, ion_replacements[ion])

    # query for cid first in order to avoid timeouterror
    cid = get_cids(compound, 'name')[0]
    smiles = get_compounds(cid)[0].canonical_smiles

    return smiles
```

%% Cell type:markdown id: tags:

Now let's actually convert all these compounds to smiles. This conversion will take a few minutes so might not be a bad spot to go grab a coffee or tea and take a break while this is running! Note that this conversion will sometimes fail so we've added some error handling to catch these cases below.

%% Cell type:code id: tags:

``` python
smiles_map = {}
for i, compound in enumerate(drugs):
    print("Converting %s to smiles" % i)
    try:
        smiles_map[compound] = compound_to_smiles(compound)
    except:
        print("Errored on %s" % i)
        continue
```

%% Output

    Converting 0 to smiles
    Converting 1 to smiles
    Converting 2 to smiles
    Converting 3 to smiles
    Converting 4 to smiles
    Converting 5 to smiles
    Converting 6 to smiles
    Converting 7 to smiles
    Converting 8 to smiles
    Converting 9 to smiles
    Converting 10 to smiles
    Converting 11 to smiles
    Converting 12 to smiles
    Converting 13 to smiles
    Converting 14 to smiles
    Converting 15 to smiles
    Converting 16 to smiles
    Converting 17 to smiles
    Converting 18 to smiles
    Converting 19 to smiles
    Converting 20 to smiles
    Converting 21 to smiles
    Converting 22 to smiles
    Converting 23 to smiles
    Converting 24 to smiles
    Converting 25 to smiles
    Converting 26 to smiles
    Converting 27 to smiles
    Converting 28 to smiles
    Converting 29 to smiles
    Converting 30 to smiles
    Converting 31 to smiles
    Converting 32 to smiles
    Converting 33 to smiles
    Converting 34 to smiles
    Converting 35 to smiles
    Converting 36 to smiles
    Converting 37 to smiles
    Converting 38 to smiles
    Converting 39 to smiles
    Converting 40 to smiles
    Converting 41 to smiles
    Converting 42 to smiles
    Converting 43 to smiles
    Converting 44 to smiles
    Converting 45 to smiles
    Converting 46 to smiles
    Converting 47 to smiles
    Converting 48 to smiles
    Converting 49 to smiles
    Converting 50 to smiles
    Converting 51 to smiles
    Converting 52 to smiles
    Converting 53 to smiles
    Converting 54 to smiles
    Converting 55 to smiles
    Converting 56 to smiles
    Converting 57 to smiles
    Converting 58 to smiles
    Converting 59 to smiles
    Converting 60 to smiles
    Converting 61 to smiles
    Converting 62 to smiles
    Converting 63 to smiles
    Converting 64 to smiles
    Converting 65 to smiles
    Converting 66 to smiles
    Converting 67 to smiles
    Converting 68 to smiles
    Converting 69 to smiles
    Converting 70 to smiles
    Converting 71 to smiles
    Converting 72 to smiles
    Converting 73 to smiles
    Converting 74 to smiles
    Converting 75 to smiles
    Converting 76 to smiles
    Converting 77 to smiles
    Converting 78 to smiles
    Converting 79 to smiles
    Converting 80 to smiles
    Converting 81 to smiles
    Converting 82 to smiles
    Converting 83 to smiles
    Converting 84 to smiles
    Converting 85 to smiles
    Converting 86 to smiles
    Converting 87 to smiles
    Converting 88 to smiles
    Converting 89 to smiles
    Converting 90 to smiles
    Converting 91 to smiles
    Converting 92 to smiles
    Converting 93 to smiles
    Converting 94 to smiles
    Converting 95 to smiles
    Converting 96 to smiles
    Converting 97 to smiles
    Converting 98 to smiles
    Converting 99 to smiles
    Converting 100 to smiles
    Converting 101 to smiles
    Converting 102 to smiles
    Converting 103 to smiles
    Converting 104 to smiles
    Converting 105 to smiles
    Converting 106 to smiles
    Converting 107 to smiles
    Converting 108 to smiles
    Converting 109 to smiles
    Converting 110 to smiles
    Converting 111 to smiles
    Converting 112 to smiles
    Converting 113 to smiles
    Converting 114 to smiles
    Converting 115 to smiles
    Converting 116 to smiles
    Converting 117 to smiles
    Converting 118 to smiles
    Converting 119 to smiles
    Converting 120 to smiles
    Converting 121 to smiles
    Converting 122 to smiles
    Converting 123 to smiles
    Converting 124 to smiles
    Converting 125 to smiles
    Converting 126 to smiles
    Converting 127 to smiles
    Converting 128 to smiles
    Converting 129 to smiles
    Converting 130 to smiles
    Converting 131 to smiles
    Converting 132 to smiles
    Converting 133 to smiles
    Converting 134 to smiles
    Converting 135 to smiles
    Converting 136 to smiles
    Converting 137 to smiles
    Converting 138 to smiles
    Converting 139 to smiles
    Converting 140 to smiles
    Converting 141 to smiles
    Converting 142 to smiles
    Converting 143 to smiles
    Converting 144 to smiles
    Converting 145 to smiles
    Converting 146 to smiles
    Converting 147 to smiles
    Converting 148 to smiles
    Converting 149 to smiles
    Converting 150 to smiles
    Converting 151 to smiles
    Converting 152 to smiles
    Converting 153 to smiles
    Converting 154 to smiles
    Converting 155 to smiles
    Converting 156 to smiles
    Converting 157 to smiles
    Converting 158 to smiles
    Converting 159 to smiles
    Converting 160 to smiles
    Converting 161 to smiles
    Converting 162 to smiles
    Errored on 162
    Converting 163 to smiles
    Converting 164 to smiles
    Converting 165 to smiles
    Converting 166 to smiles
    Converting 167 to smiles
    Converting 168 to smiles
    Converting 169 to smiles
    Converting 170 to smiles
    Converting 171 to smiles
    Converting 172 to smiles
    Converting 173 to smiles
    Converting 174 to smiles
    Converting 175 to smiles
    Converting 176 to smiles
    Converting 177 to smiles
    Converting 178 to smiles
    Converting 179 to smiles
    Converting 180 to smiles
    Converting 181 to smiles
    Converting 182 to smiles
    Converting 183 to smiles
    Converting 184 to smiles
    Converting 185 to smiles
    Converting 186 to smiles
    Converting 187 to smiles
    Converting 188 to smiles
    Converting 189 to smiles
    Converting 190 to smiles
    Converting 191 to smiles
    Converting 192 to smiles
    Converting 193 to smiles
    Converting 194 to smiles
    Converting 195 to smiles
    Converting 196 to smiles
    Converting 197 to smiles
    Converting 198 to smiles
    Converting 199 to smiles
    Converting 200 to smiles
    Converting 201 to smiles
    Converting 202 to smiles
    Converting 203 to smiles
    Converting 204 to smiles
    Converting 205 to smiles
    Converting 206 to smiles
    Converting 207 to smiles
    Converting 208 to smiles
    Converting 209 to smiles
    Converting 210 to smiles
    Converting 211 to smiles
    Converting 212 to smiles
    Converting 213 to smiles
    Converting 214 to smiles
    Converting 215 to smiles
    Converting 216 to smiles
    Converting 217 to smiles
    Converting 218 to smiles
    Converting 219 to smiles
    Converting 220 to smiles
    Converting 221 to smiles
    Converting 222 to smiles
    Converting 223 to smiles
    Converting 224 to smiles
    Converting 225 to smiles
    Converting 226 to smiles
    Converting 227 to smiles
    Converting 228 to smiles
    Converting 229 to smiles
    Converting 230 to smiles
    Converting 231 to smiles
    Converting 232 to smiles
    Converting 233 to smiles
    Converting 234 to smiles
    Converting 235 to smiles
    Converting 236 to smiles
    Converting 237 to smiles
    Converting 238 to smiles
    Converting 239 to smiles
    Converting 240 to smiles
    Converting 241 to smiles
    Converting 242 to smiles
    Converting 243 to smiles
    Converting 244 to smiles
    Converting 245 to smiles
    Converting 246 to smiles
    Converting 247 to smiles
    Converting 248 to smiles
    Converting 249 to smiles
    Converting 250 to smiles
    Converting 251 to smiles
    Converting 252 to smiles
    Converting 253 to smiles
    Converting 254 to smiles
    Converting 255 to smiles
    Converting 256 to smiles
    Converting 257 to smiles
    Converting 258 to smiles
    Converting 259 to smiles
    Converting 260 to smiles
    Converting 261 to smiles
    Converting 262 to smiles
    Converting 263 to smiles
    Converting 264 to smiles
    Converting 265 to smiles
    Converting 266 to smiles
    Converting 267 to smiles
    Converting 268 to smiles
    Converting 269 to smiles
    Converting 270 to smiles
    Converting 271 to smiles
    Converting 272 to smiles
    Converting 273 to smiles
    Converting 274 to smiles
    Converting 275 to smiles
    Converting 276 to smiles
    Converting 277 to smiles
    Converting 278 to smiles
    Converting 279 to smiles
    Converting 280 to smiles
    Converting 281 to smiles
    Converting 282 to smiles
    Converting 283 to smiles
    Converting 284 to smiles
    Converting 285 to smiles
    Converting 286 to smiles
    Converting 287 to smiles
    Converting 288 to smiles
    Converting 289 to smiles
    Converting 290 to smiles
    Converting 291 to smiles
    Converting 292 to smiles
    Converting 293 to smiles
    Converting 294 to smiles
    Converting 295 to smiles
    Converting 296 to smiles
    Converting 297 to smiles
    Converting 298 to smiles
    Converting 299 to smiles
    Converting 300 to smiles
    Converting 301 to smiles
    Converting 302 to smiles
    Converting 303 to smiles
    Errored on 303
    Converting 304 to smiles
    Converting 305 to smiles
    Converting 306 to smiles
    Converting 307 to smiles
    Converting 308 to smiles
    Converting 309 to smiles
    Converting 310 to smiles
    Converting 311 to smiles
    Converting 312 to smiles
    Converting 313 to smiles
    Converting 314 to smiles
    Converting 315 to smiles
    Converting 316 to smiles
    Converting 317 to smiles
    Converting 318 to smiles
    Converting 319 to smiles
    Converting 320 to smiles
    Converting 321 to smiles
    Converting 322 to smiles
    Converting 323 to smiles
    Converting 324 to smiles
    Converting 325 to smiles
    Converting 326 to smiles
    Converting 327 to smiles
    Converting 328 to smiles
    Converting 329 to smiles
    Converting 330 to smiles

    Converting 331 to smiles
    Converting 332 to smiles
    Converting 333 to smiles
    Converting 334 to smiles
    Converting 335 to smiles
    Converting 336 to smiles
    Converting 337 to smiles
    Converting 338 to smiles
    Converting 339 to smiles
    Converting 340 to smiles
    Converting 341 to smiles
    Converting 342 to smiles
    Converting 343 to smiles
    Converting 344 to smiles
    Converting 345 to smiles
    Converting 346 to smiles
    Converting 347 to smiles
    Converting 348 to smiles
    Converting 349 to smiles
    Converting 350 to smiles
    Converting 351 to smiles
    Converting 352 to smiles
    Converting 353 to smiles
    Converting 354 to smiles
    Converting 355 to smiles
    Converting 356 to smiles
    Converting 357 to smiles
    Converting 358 to smiles
    Converting 359 to smiles
    Converting 360 to smiles
    Converting 361 to smiles
    Converting 362 to smiles
    Converting 363 to smiles
    Converting 364 to smiles
    Converting 365 to smiles
    Converting 366 to smiles
    Converting 367 to smiles
    Converting 368 to smiles
    Converting 369 to smiles
    Converting 370 to smiles
    Converting 371 to smiles
    Converting 372 to smiles
    Converting 373 to smiles
    Converting 374 to smiles
    Converting 375 to smiles
    Converting 376 to smiles
    Converting 377 to smiles
    Converting 378 to smiles
    Converting 379 to smiles
    Converting 380 to smiles
    Converting 381 to smiles
    Converting 382 to smiles
    Converting 383 to smiles
    Converting 384 to smiles
    Converting 385 to smiles
    Converting 386 to smiles
    Converting 387 to smiles
    Converting 388 to smiles
    Converting 389 to smiles
    Converting 390 to smiles
    Converting 391 to smiles
    Converting 392 to smiles
    Converting 393 to smiles
    Converting 394 to smiles
    Converting 395 to smiles
    Converting 396 to smiles
    Converting 397 to smiles
    Converting 398 to smiles
    Converting 399 to smiles
    Converting 400 to smiles
    Converting 401 to smiles
    Converting 402 to smiles
    Converting 403 to smiles
    Converting 404 to smiles
    Converting 405 to smiles
    Converting 406 to smiles
    Converting 407 to smiles
    Converting 408 to smiles
    Converting 409 to smiles
    Converting 410 to smiles
    Converting 411 to smiles
    Converting 412 to smiles
    Converting 413 to smiles
    Converting 414 to smiles
    Converting 415 to smiles
    Converting 416 to smiles
    Converting 417 to smiles
    Converting 418 to smiles
    Converting 419 to smiles
    Converting 420 to smiles
    Converting 421 to smiles
    Converting 422 to smiles
    Converting 423 to smiles
    Converting 424 to smiles
    Converting 425 to smiles
    Converting 426 to smiles
    Converting 427 to smiles
    Converting 428 to smiles
    Converting 429 to smiles

%% Cell type:code id: tags:

``` python
smiles_data = raw_data
# map drug name to smiles string
smiles_data['drug'] = smiles_data['drug'].apply(lambda x: smiles_map[x] if x in smiles_map else None)
```

%% Cell type:code id: tags:

``` python
# preview smiles data
smiles_data.loc[smiles_data.index[:5]]
```

%% Output

       label                                               drug       n1       n2
    0      2  CC1(C(N2C(S1)C(C2=O)NC(=O)COC3=CC=CC=C3)C(=O)[...  -10.404 -18.1929
    1      3  CC1=C2COC(=O)C2=C(C(=C1OC)CC=C(C)CCC(=O)OCCN3C... -12.4453 -11.7175
    2      4                     CC1=CC(=CC(=C1)OCC2CNC(=O)O2)C -8.65572 -17.7753
    3      5  COC1=C(C=C2C(=C1)C(=NC(=N2)N3CCN(CC3)C(=O)C4CC... -11.5048  16.0825
    4      6  CC(C)N1C2=CC=CC=C2C(=C1C=CC(CC(CC(=O)[O-])O)O)... -11.1354  -14.553

%% Cell type:markdown id: tags:

Hooray, we have mapped each drug name to its corresponding smiles code.

Now, we need to look at the data and remove as much noise as possible.

%% Cell type:markdown id: tags:

## De-noising data

%% Cell type:markdown id: tags:

In machine learning, we know that there is no free lunch.  You will need to spend time analyzing and understanding your data in order to frame your problem and determine the appropriate model framework.  Treatment of your data will depend on the conclusions you gather from this process.

Questions to ask yourself:
* What are you trying to accomplish?
* What is your assay?
* What is the structure of the data?
* Does the data make sense?
* What has been tried previously?

For this project (respectively):
* I would like to build a model capable of predicting the affinity of an arbitrary small molecule drug to a particular ion channel protein
* For an input drug, data describing channel inhibition
* A few hundred drugs, with n=2
* Will need to look more closely at the dataset*
* Nothing on this particular protein

%% Cell type:markdown id: tags:

*This will involve plotting, so we will import matplotlib and seaborn.  We will also need to look at molecular structures, so we will import rdkit. We will also use the seaborn library which you can install with `pip install seaborn`.

%% Cell type:code id: tags:

``` python
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set_style('white')
```

%% Cell type:code id: tags:

``` python
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw, PyMol, rdFMCS
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
```

%% Output

    RDKit WARNING: [17:10:53] Enabling RDKit 2019.09.3 jupyter extensions

%% Cell type:code id: tags:

``` python
# i will use numpy on occasion for manipulating arrays
import numpy as np
```

%% Cell type:markdown id: tags:

Our goal is to build a small molecule model, so let's make sure our molecules are all small.  This can be approximated by the length of each smiles string.

%% Cell type:code id: tags:

``` python
smiles_data['len'] = [len(i) if i is not None else 0 for i in smiles_data['drug']]
```

%% Cell type:code id: tags:

``` python
smiles_lens = [len(i) if i is not None else 0 for i in smiles_data['drug']]
sns.distplot(smiles_lens)
plt.xlabel('len(smiles)')
plt.ylabel('probability')
```

%% Output

    Text(0,0.5,'probability')


%% Cell type:markdown id: tags:

Some of these look rather large, len(smiles) > 150.  Let's see what they look like.

%% Cell type:code id: tags:

``` python
# indices of large looking molecules
suspiciously_large = np.where(np.array(smiles_lens) > 150)[0]

# corresponding smiles string
long_smiles = smiles_data.loc[smiles_data.index[suspiciously_large]]['drug'].values
```

%% Cell type:code id: tags:

``` python
# look
Draw._MolsToGridImage([Chem.MolFromSmiles(i) for i in long_smiles], molsPerRow=6)
```

%% Output

    <PIL.PngImagePlugin.PngImageFile image mode=RGB size=1200x200 at 0x11B19C7B8>

%% Cell type:markdown id: tags:

As suspected, these are not small molecules, so we will remove them from the dataset.  The argument here is that these molecules could register as inhibitors simply because they are large.  They are more likely to sterically blocks the channel, rather than diffuse inside and bind (which is what we are interested in).

The lesson here is to remove data that does not fit your use case.

%% Cell type:code id: tags:

``` python
# drop large molecules
smiles_data = smiles_data[~smiles_data['drug'].isin(long_smiles)]
```

%% Cell type:markdown id: tags:

Now, let's look at the numerical structure of the dataset.

First, check for NaNs.

%% Cell type:code id: tags:

``` python
nan_rows = smiles_data[smiles_data.isnull().T.any().T]
nan_rows[['n1', 'n2']]
```

%% Output

              n1       n2
    62       NaN  -7.8266
    162 -12.8456 -11.4627
    175      NaN -6.61225
    187      NaN -8.23326
    233 -8.21781      NaN
    262      NaN -12.8788
    288      NaN -2.34264
    300      NaN -8.19936
    301      NaN -10.4633
    303 -5.61374  8.42267
    311      NaN -8.78722

%% Cell type:markdown id: tags:

I don't trust n=1, so I will throw these out.

Then, let's examine the distribution of n1 and n2.

%% Cell type:code id: tags:

``` python
df = smiles_data.dropna(axis=0, how='any')
```

%% Cell type:code id: tags:

``` python
# seaborn jointplot will allow us to compare n1 and n2, and plot each marginal
sns.jointplot('n1', 'n2', data=smiles_data)
```

%% Output

    <seaborn.axisgrid.JointGrid at 0x1a20d49ef0>


%% Cell type:markdown id: tags:

We see that most of the data is contained in the gaussian-ish blob centered a bit below zero.  We see that there are a few clearly active datapoints located in the bottom left, and one on the top right.  These are all distinguished from the majority of the data.  How do we handle the data in the blob?

Because n1 and n2 represent the same measurement, ideally they would be of the same value.  This plot should be tightly aligned to the diagonal, and the pearson correlation coefficient should be 1.  We see this is not the case.  This helps gives us an idea of the error of our assay.

Let's look at the error more closely, plotting in the distribution of (n1-n2).

%% Cell type:code id: tags:

``` python
diff_df = df['n1'] - df['n2']

sns.distplot(diff_df)
plt.xlabel('difference in n')
plt.ylabel('probability')
```

%% Output

    Text(0,0.5,'probability')


%% Cell type:markdown id: tags:

This looks pretty gaussian, let's get the 95% confidence interval by fitting a gaussian via scipy, and taking 2*the standard deviation

%% Cell type:code id: tags:

``` python
from scipy import stats
```

%% Cell type:code id: tags:

``` python
mean, std = stats.norm.fit(np.asarray(diff_df, dtype=np.float32))
```

%% Cell type:code id: tags:

``` python
ci_95 = std*2
ci_95
```

%% Output

    17.75387954711914

%% Cell type:markdown id: tags:

Now, I don't trust the data outside of the confidence interval, and will therefore drop these datapoints from df.

For example, in the plot above, at least one datapoint has n1-n2 > 60.  This is disconcerting.

%% Cell type:code id: tags:

``` python
noisy = diff_df[abs(diff_df) > ci_95]
df = df.drop(noisy.index)
```

%% Cell type:code id: tags:

``` python
sns.jointplot('n1', 'n2', data=df)
```

%% Output

    <seaborn.axisgrid.JointGrid at 0x1a211eb5f8>


%% Cell type:markdown id: tags:

Now that data looks much better!

So, let's average n1 and n2, and take the error bar to be ci_95.

%% Cell type:code id: tags:

``` python
avg_df = df[['label', 'drug']]
n_avg = df[['n1', 'n2']].mean(axis=1)
avg_df['n'] = n_avg
avg_df.sort_values('n', inplace=True)
```

%% Output

    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:
    A value is trying to be set on a copy of a slice from a DataFrame.
    Try using .loc[row_indexer,col_indexer] = value instead
    
    See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
      This is separate from the ipykernel package so we can avoid doing imports until
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning:
    A value is trying to be set on a copy of a slice from a DataFrame
    
    See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
      after removing the cwd from sys.path.

%% Cell type:markdown id: tags:

Now, let's look at the sorted data with error bars.

%% Cell type:code id: tags:

``` python
plt.errorbar(np.arange(avg_df.shape[0]), avg_df['n'], yerr=ci_95, fmt='o')
plt.xlabel('drug, sorted')
plt.ylabel('activity')
```

%% Output

    Text(0,0.5,'activity')


%% Cell type:markdown id: tags:

Now, let's identify our active compounds.

In my case, this required domain knowledge.  Having worked in this area, and having consulted with professors specializing on this channel, I am interested in compounds where the absolute value of the activity is greater than 25.  This relates to the desired drug potency we would like to model.

If you are not certain how to draw the line between active and inactive, this cutoff could potentially be treated as a hyperparameter.

%% Cell type:code id: tags:

``` python
actives = avg_df[abs(avg_df['n'])-ci_95 > 25]['n']

plt.errorbar(np.arange(actives.shape[0]), actives, yerr=ci_95, fmt='o')
```

%% Output

    <ErrorbarContainer object of 3 artists>


%% Cell type:code id: tags:

``` python
# summary
print (raw_data.shape, avg_df.shape, len(actives.index))
```

%% Output

    (430, 5) (392, 3) 6

%% Cell type:markdown id: tags:

In summary, we have:
* Removed data that did not address the question we hope to answer (small molecules only)
* Dropped NaNs
* Determined the noise of our measurements
* Removed exceptionally noisy datapoints
* Identified actives (using domain knowledge to determine a threshold)

%% Cell type:markdown id: tags:

## Determine model type, final form of dataset, and sanity load

%% Cell type:markdown id: tags:

Now, what model framework should we use?

Given that we have 392 datapoints and 6 actives, this data will be used to build a low data one-shot classifier (10.1021/acscentsci.6b00367).  If there were datasets of similar character, transfer learning could potentially be used, but this is not the case at the moment.


Let's apply logic to our dataframe in order to cast it into a binary format, suitable for classification.

%% Cell type:code id: tags:

``` python
# 1 if condition for active is met, 0 otherwise
avg_df['active'] = (abs(avg_df['n'])-ci_95 > 25).astype(int)
```

%% Output

    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
    A value is trying to be set on a copy of a slice from a DataFrame.
    Try using .loc[row_indexer,col_indexer] = value instead
    
    See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
    

%% Cell type:markdown id: tags:

Now, save this to file.

%% Cell type:code id: tags:

``` python
avg_df.to_csv('modulators.csv', index=False)
```

%% Cell type:markdown id: tags:

Now, we will convert this dataframe to a DeepChem dataset.

%% Cell type:code id: tags:

``` python
import deepchem as dc
```

%% Output

    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
      warnings.warn(msg, category=FutureWarning)
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint8 = np.dtype([("qint8", np.int8, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint16 = np.dtype([("qint16", np.int16, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint32 = np.dtype([("qint32", np.int32, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      np_resource = np.dtype([("resource", np.ubyte, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint8 = np.dtype([("qint8", np.int8, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint16 = np.dtype([("qint16", np.int16, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint32 = np.dtype([("qint32", np.int32, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      np_resource = np.dtype([("resource", np.ubyte, 1)])

%% Cell type:code id: tags:

``` python
dataset_file = 'modulators.csv'
task = ['active']
featurizer_func = dc.feat.ConvMolFeaturizer()

loader = dc.data.CSVLoader(tasks=task, smiles_field='drug', featurizer=featurizer_func)
dataset = loader.featurize(dataset_file)
```

%% Output

    Loading raw samples now.
    shard_size: 8192
    About to start loading CSV from modulators.csv
    Loading shard 1 of size 8192.
    Featurizing sample 0
    TIMING: featurizing shard 0 took 0.689 s
    TIMING: dataset construction took 0.825 s
    Loading dataset from disk.

%% Cell type:markdown id: tags:

Lastly, it is often advantageous to numerically transform the data in some way.  For example, sometimes it is useful to normalize the data, or to zero the mean.  This depends in the task at hand.

Built into DeepChem are many useful transformers, located in the deepchem.transformers.transformers base class.

Because this is a classification model, and the number of actives is low, I will apply a balancing transformer.  I treated this transformer as a hyperparameter when I began training models.  It proved to unambiguously improve model performance.

%% Cell type:code id: tags:

``` python
transformer = dc.trans.BalancingTransformer(transform_w=True, dataset=dataset)
dataset = transformer.transform(dataset)
```

%% Output

    TIMING: dataset construction took 0.160 s
    Loading dataset from disk.

%% Cell type:markdown id: tags:

Now let's save the balanced dataset object to disk, and then reload it as a sanity check.

%% Cell type:code id: tags:

``` python
dc.utils.save.save_to_disk(dataset, 'balanced_dataset.joblib')
balanced_dataset = dc.utils.save.load_from_disk('balanced_dataset.joblib')
```

%% Cell type:markdown id: tags:

Tutorial written by Keri McKiernan (github.com/kmckiern) on September 8, 2016
+0 −29
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

### Input Formats
DeepChem supports a whole range of input files. For example, accepted input formats for deepchem include .csv, .sdf, .fasta, .png, .tif and other file formats. The loading for a particular file format is governed by `Loader` class associated with that format. For example, with a csv input, we use the `CSVLoader` class underneath the hood. Here's an example of a sample .csv file that fits the requirements of `CSVLoader`.

1. A column containing SMILES strings [1].
2. A column containing an experimental measurement.
3. (Optional) A column containing a unique compound identifier.

Here's an example of a potential input file.

|Compound ID    | measured log solubility in mols per litre | smiles         |
|---------------|-------------------------------------------|----------------|
| benzothiazole | -1.5                                      | c2ccc1scnc1c2  |


Here the "smiles" column contains the SMILES string, the "measured log
solubility in mols per litre" contains the experimental measurement and
"Compound ID" contains the unique compound identifier.

[2] Anderson, Eric, Gilman D. Veith, and David Weininger. "SMILES, a line
notation and computerized interpreter for chemical structures." US
Environmental Protection Agency, Environmental Research Laboratory, 1987.

### Data Featurization

Most machine learning algorithms require that input data form vectors. However, input data for drug-discovery datasets routinely come in the format of lists of molecules and associated experimental readouts. To
transform lists of molecules into vectors, we need to subclasses of DeepChem loader class ```dc.data.DataLoader``` such as ```dc.data.CSVLoader``` or ```dc.data.SDFLoader```. Users can subclass ```dc.data.DataLoader``` to
load arbitrary file formats. All loaders must be passed a ```dc.feat.Featurizer``` object. DeepChem provides a number of different subclasses of ```dc.feat.Featurizer``` for convenience.
+140 −44

File changed.

Preview size limit exceeded, changes collapsed.

+34 −37
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Using Deepchem Datasets
In this tutorial we will have a look at various deepchem `dataset` methods present in `deepchem.datasets`.

%% Cell type:code id: tags:

``` python
import deepchem as dc
import numpy as np
import random
```

%% Output

    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
      warnings.warn(msg, category=FutureWarning)
    RDKit WARNING: [17:49:31] Enabling RDKit 2019.09.3 jupyter extensions
    RDKit WARNING: [23:36:48] Enabling RDKit 2019.09.3 jupyter extensions
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint8 = np.dtype([("qint8", np.int8, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint16 = np.dtype([("qint16", np.int16, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint32 = np.dtype([("qint32", np.int32, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      np_resource = np.dtype([("resource", np.ubyte, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint8 = np.dtype([("qint8", np.int8, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint16 = np.dtype([("qint16", np.int16, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint32 = np.dtype([("qint32", np.int32, 1)])
    /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      np_resource = np.dtype([("resource", np.ubyte, 1)])

%% Cell type:markdown id: tags:

# Using NumpyDatasets

The `dc.data.NumpyDatasets` class is used when you have your data in numpy arrays. It provides a simple wrapper around a collection of Numpy datasets.

%% Cell type:code id: tags:

``` python
# data is your dataset in numpy array of size : 20x20.
data = np.random.random((4, 4))
labels = np.random.random((4,)) # labels of size 20x1
```

%% Cell type:code id: tags:

``` python
from deepchem.data.datasets import NumpyDataset # import NumpyDataset
```

%% Cell type:code id: tags:

``` python
dataset = NumpyDataset(data, labels) # creates numpy dataset object
dataset
```

%% Output

    <deepchem.data.datasets.NumpyDataset at 0x1a3a047a20>
    <deepchem.data.datasets.NumpyDataset at 0x1a38c2beb8>

%% Cell type:markdown id: tags:

## Extracting X, y from NumpyDataset Object
Extracting the data and labels from the NumpyDataset is very easy.

%% Cell type:code id: tags:

``` python
dataset.X # Extracts the data (X) from the NumpyDataset Object
```

%% Output

    array([[0.77747145, 0.83258316, 0.76509785, 0.36074566],
           [0.28224673, 0.79519759, 0.93776705, 0.2213494 ],
           [0.54740751, 0.38403327, 0.12592795, 0.94350571],
           [0.02717497, 0.75938816, 0.9477633 , 0.80792975]])
    array([[0.01109257, 0.62277556, 0.02058281, 0.83395641],
           [0.29591717, 0.69673525, 0.67001604, 0.68693823],
           [0.84156563, 0.02039639, 0.83506678, 0.4422977 ],
           [0.39966698, 0.57210768, 0.4434791 , 0.16909073]])

%% Cell type:code id: tags:

``` python
dataset.y # Extracts the labels (y) from the NumpyDataset Object
```

%% Output

    array([0.76098746, 0.12423036, 0.24516253, 0.84793405])
    array([0.90696188, 0.45977404, 0.96922696, 0.24167064])

%% Cell type:markdown id: tags:

## Weights of a dataset - w
So apart from `X` and `y` which are the data and the labels, you can also assign weights `w` to each data instance. The dimension of `w` is same as that of `y`(which is `Nx1` where `N` is the number of data instances).

**NOTE:** By default `w` is a vector initialized with equal weights (all being 1).

%% Cell type:code id: tags:

``` python
dataset.w # printing the weights that are assigned by default. Notice that they are a vector of 1's
```

%% Output

    array([[1.],
           [1.],
           [1.],
           [1.]])
    array([1., 1., 1., 1.], dtype=float32)

%% Cell type:code id: tags:

``` python
w = np.random.random((4,)) # initializing weights with random vector of size 20x1
dataset_with_weights = NumpyDataset(data, labels, w) # creates numpy dataset object
```

%% Cell type:code id: tags:

``` python
dataset_with_weights.w
```

%% Output

    array([0.10909932, 0.54252096, 0.70115951, 0.39749864])
    array([0.76645723, 0.44698502, 0.34730918, 0.40243847])

%% Cell type:markdown id: tags:

## Iterating over NumpyDataset
In order to iterate over NumpyDataset, we use `itersamples` method. We iterate over 4 quantities, namely `X`, `y`, `w` and `ids`. The first three quantities are the same as discussed above and `ids` is the id of the data instance. By default the id is given in order starting from `1`

%% Cell type:code id: tags:

``` python
for x, y, w, id in dataset.itersamples():
    print(x, y, w, id)
```

%% Output

    [0.77747145 0.83258316 0.76509785 0.36074566] 0.7609874556128873 1.0 0
    [0.28224673 0.79519759 0.93776705 0.2213494 ] 0.1242303578243128 1.0 1
    [0.54740751 0.38403327 0.12592795 0.94350571] 0.2451625327575474 1.0 2
    [0.02717497 0.75938816 0.9477633  0.80792975] 0.8479340478005098 1.0 3
    [0.01109257 0.62277556 0.02058281 0.83395641] 0.9069618791084421 1.0 0
    [0.29591717 0.69673525 0.67001604 0.68693823] 0.45977403789888005 1.0 1
    [0.84156563 0.02039639 0.83506678 0.4422977 ] 0.9692269600648693 1.0 2
    [0.39966698 0.57210768 0.4434791  0.16909073] 0.24167063686752832 1.0 3

%% Cell type:markdown id: tags:

You can also extract the ids by `dataset.ids`. This would return a numpy array consisting of the ids of the data instances.

%% Cell type:code id: tags:

``` python
dataset.ids
```

%% Output

    array([0, 1, 2, 3], dtype=object)

%% Cell type:markdown id: tags:

## MNIST Example
Just to get a better understanding, lets take read MNIST data and use `NumpyDataset` to store the data.

%% Cell type:code id: tags:

``` python
from tensorflow.examples.tutorials.mnist import input_data
```

%% Cell type:code id: tags:

``` python
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
```

%% Output

    WARNING:tensorflow:From <ipython-input-14-a839aeb82f4b>:1: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
    WARNING:tensorflow:From <ipython-input-13-a839aeb82f4b>:1: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
    Instructions for updating:
    Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
    WARNING:tensorflow:From /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
    Instructions for updating:
    Please write your own downloading logic.
    WARNING:tensorflow:From /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
    Instructions for updating:
    Please use tf.data to implement this functionality.
    Extracting MNIST_data/train-images-idx3-ubyte.gz
    WARNING:tensorflow:From /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
    Instructions for updating:
    Please use tf.data to implement this functionality.
    Extracting MNIST_data/train-labels-idx1-ubyte.gz
    WARNING:tensorflow:From /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:110: dense_to_one_hot (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
    Instructions for updating:
    Please use tf.one_hot on tensors.
    Extracting MNIST_data/t10k-images-idx3-ubyte.gz
    Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
    WARNING:tensorflow:From /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: DataSet.__init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
    Instructions for updating:
    Please use alternatives such as official/mnist/dataset.py from tensorflow/models.

%% Cell type:code id: tags:

``` python
# Load the numpy data of MNIST into NumpyDataset
train = NumpyDataset(mnist.train.images, mnist.train.labels)
valid = NumpyDataset(mnist.validation.images, mnist.validation.labels)
```

%% Cell type:code id: tags:

``` python
import matplotlib.pyplot as plt
```

%% Cell type:code id: tags:

``` python
# Visualize one sample
sample = np.reshape(train.X[5], (28, 28))
plt.imshow(sample)
plt.show()
```

%% Output


%% Cell type:markdown id: tags:

## Converting a Numpy Array to tf.data.dataset()


Let's say you want to use the `tf.data` module instead of DeepChem's data handling library. Doing this is straightforward and is quite similar to getting a `NumpyDataset` object from numpy arrays.

%% Cell type:code id: tags:

``` python
import tensorflow as tf
data_small = np.random.random((4,5))
label_small = np.random.random((4,))
dataset = tf.data.Dataset.from_tensor_slices((data_small, label_small))
print ("Data\n")
print (data_small)
print ("\n Labels")
print (label_small)
```

%% Output

    Data
    
    [[0.09102272 0.07158817 0.85294433 0.72889589 0.00564065]
     [0.26971883 0.51840485 0.69322473 0.85085169 0.11202028]
     [0.14868434 0.83661216 0.32333968 0.64312229 0.44279518]
     [0.15123109 0.3443811  0.04610284 0.66125549 0.26025301]]
    [[0.78113293 0.01674453 0.48489516 0.69356293 0.91605677]
     [0.70025413 0.66522493 0.03279785 0.1810656  0.34951665]
     [0.8357952  0.68600992 0.19022591 0.6087858  0.61117143]
     [0.02318132 0.85849407 0.31825101 0.83070808 0.13985736]]
    
     Labels
    [0.75613112 0.97179618 0.33262846 0.54677704]
    [0.41737946 0.83331863 0.89246031 0.21424502]

%% Cell type:markdown id: tags:

## Extracting the numpy dataset from tf.data

In order to extract the numpy array from the `tf.data`, you first need to define an `iterator` to iterate over the `tf.data.Dataset` object and then in the tensorflow session, run over the iterator to get the data instances. Let's have a look at how it's done.

%% Cell type:code id: tags:

``` python
iterator = dataset.make_one_shot_iterator() # iterator
next_element = iterator.get_next()
numpy_data = np.zeros((4, 5))
numpy_label = np.zeros((4,))
sess = tf.Session() # tensorflow session
for i in range(4):
    data_, label_ = sess.run(next_element) # data_ contains the data and label_ contains the labels that we fed in the previous step
    numpy_data[i, :] = data_
    numpy_label[i] = label_

print ("Numpy Data")
print(numpy_data)
print ("\n Numpy Label")
print(numpy_label)
```

%% Output

    WARNING:tensorflow:From <ipython-input-19-f67e6d094179>:1: DatasetV1.make_one_shot_iterator (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
    WARNING:tensorflow:From <ipython-input-18-f67e6d094179>:1: DatasetV1.make_one_shot_iterator (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
    Instructions for updating:
    Use `for ... in dataset:` to iterate over a dataset. If using `tf.estimator`, return the `Dataset` object directly from your input function. As a last resort, you can use `tf.compat.v1.data.make_one_shot_iterator(dataset)`.
    Numpy Data
    [[0.09102272 0.07158817 0.85294433 0.72889589 0.00564065]
     [0.26971883 0.51840485 0.69322473 0.85085169 0.11202028]
     [0.14868434 0.83661216 0.32333968 0.64312229 0.44279518]
     [0.15123109 0.3443811  0.04610284 0.66125549 0.26025301]]
    [[0.78113293 0.01674453 0.48489516 0.69356293 0.91605677]
     [0.70025413 0.66522493 0.03279785 0.1810656  0.34951665]
     [0.8357952  0.68600992 0.19022591 0.6087858  0.61117143]
     [0.02318132 0.85849407 0.31825101 0.83070808 0.13985736]]
    
     Numpy Label
    [0.75613112 0.97179618 0.33262846 0.54677704]
    [0.41737946 0.83331863 0.89246031 0.21424502]

%% Cell type:markdown id: tags:

Now that you have the numpy arrays of `data` and `labels`, you can convert it to `NumpyDataset`.

%% Cell type:code id: tags:

``` python
dataset_ = NumpyDataset(numpy_data, numpy_label) # convert to NumpyDataset
dataset_.X  # printing just to check if the data is same!!
```

%% Output

    array([[0.09102272, 0.07158817, 0.85294433, 0.72889589, 0.00564065],
           [0.26971883, 0.51840485, 0.69322473, 0.85085169, 0.11202028],
           [0.14868434, 0.83661216, 0.32333968, 0.64312229, 0.44279518],
           [0.15123109, 0.3443811 , 0.04610284, 0.66125549, 0.26025301]])
    array([[0.78113293, 0.01674453, 0.48489516, 0.69356293, 0.91605677],
           [0.70025413, 0.66522493, 0.03279785, 0.1810656 , 0.34951665],
           [0.8357952 , 0.68600992, 0.19022591, 0.6087858 , 0.61117143],
           [0.02318132, 0.85849407, 0.31825101, 0.83070808, 0.13985736]])

%% Cell type:markdown id: tags:

## Converting NumpyDataset to `tf.data`
This can be easily done by the `make_iterator()` method of `NumpyDataset`. This converts the `NumpyDataset` to `tf.data`. Let's look how it's done!

%% Cell type:code id: tags:

``` python
iterator_ = dataset_.make_iterator() # Using make_iterator for converting NumpyDataset to tf.data
next_element_ = iterator_.get_next()

sess = tf.Session() # tensorflow session
data_and_labels = sess.run(next_element_) # data_ contains the data and label_ contains the labels that we fed in the previous step


print ("Numpy Data")
print(data_and_labels[0])  # Data in the first index
print ("\n Numpy Label")
print(data_and_labels[1])  # Labels in the second index
```

%% Output

    WARNING:tensorflow:From /Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py:494: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
    Instructions for updating:
    tf.py_func is deprecated in TF V2. Instead, there are two
        options available in V2.
        - tf.py_function takes a python function which manipulates tf eager
        tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
        an ndarray (just call tensor.numpy()) but having access to eager tensors
        means `tf.py_function`s can use accelerators such as GPUs as well as
        being differentiable using a gradient tape.
        - tf.numpy_function maintains the semantics of the deprecated tf.py_func
        (it is not differentiable, and manipulates numpy arrays). It drops the
        stateful argument making all functions stateful.
    
    Numpy Data
    [[0.26971883 0.51840485 0.69322473 0.85085169 0.11202028]
     [0.14868434 0.83661216 0.32333968 0.64312229 0.44279518]
     [0.09102272 0.07158817 0.85294433 0.72889589 0.00564065]
     [0.15123109 0.3443811  0.04610284 0.66125549 0.26025301]]
    [[0.02318132 0.85849407 0.31825101 0.83070808 0.13985736]
     [0.78113293 0.01674453 0.48489516 0.69356293 0.91605677]
     [0.8357952  0.68600992 0.19022591 0.6087858  0.61117143]
     [0.70025413 0.66522493 0.03279785 0.1810656  0.34951665]]
    
     Numpy Label
    [0.97179618 0.33262846 0.75613112 0.54677704]
    [0.21424502 0.41737946 0.89246031 0.83331863]