Commit 91c692b2 authored by pvskand's avatar pvskand
Browse files

added how to convert tf.data to NumpyDataset

parent 9c191325
Loading
Loading
Loading
Loading
+77 −32
Original line number Diff line number Diff line
%% Cell type:markdown id: tags:

# Using Deepchem Datasets
In this tutorial we will have a look at various deepchem `dataset` methods present in `deepchem.datasets`.

%% Cell type:code id: tags:

``` python
import deepchem as dc
import numpy as np
import random
```

%% Output

    /home/skand/anaconda2/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
      from ._conv import register_converters as _register_converters

%% Cell type:markdown id: tags:

# Using NumpyDatasets
This is used when you have your data in numpy arrays.

%% Cell type:code id: tags:

``` python
# data is your dataset in numpy array of size : 20x20.
data = np.random.random((4, 4))
labels = np.random.random((4,)) # labels of size 20x1
```

%% Cell type:code id: tags:

``` python
from deepchem.data.datasets import NumpyDataset # import NumpyDataset
```

%% Cell type:code id: tags:

``` python
dataset = NumpyDataset(data, labels) # creates numpy dataset object
```

%% Cell type:markdown id: tags:

## Extracting X, y from NumpyDataset Object
Extracting the data and labels from the NumpyDataset is very easy.

%% Cell type:code id: tags:

``` python
dataset.X # Extracts the data (X) from the NumpyDataset Object
```

%% Output

    array([[0.85221987, 0.47412003, 0.71233837, 0.59094892],
           [0.39387594, 0.99322661, 0.75225026, 0.00995347],
           [0.24524296, 0.96471994, 0.41466874, 0.99579889],
           [0.01912096, 0.99213349, 0.61235698, 0.06214374]])
    array([[0.77618534, 0.76896038, 0.43433514, 0.69623474],
           [0.23229041, 0.40810229, 0.28852268, 0.83997671],
           [0.11555096, 0.94556341, 0.04440153, 0.49396037],
           [0.71786872, 0.13169183, 0.28161187, 0.789942  ]])

%% Cell type:code id: tags:

``` python
dataset.y # Extracts the labels (y) from the NumpyDataset Object
```

%% Output

    array([[0.75443686],
           [0.78473712],
           [0.6223576 ],
           [0.53884944]])
    array([[0.52468291],
           [0.45188867],
           [0.16465562],
           [0.57194239]])

%% Cell type:markdown id: tags:

## Weights of a dataset - w
So apart from `X` and `y` which are the data and the labels, you can also assign weights `w` to each data instance. The dimension of `w` is same as that of `y`(which is Nx1 where N is the number of data instances).

**NOTE:** By default `w` is a vector initialized with equal weights (all being 1).

%% Cell type:code id: tags:

``` python
dataset.w # printing the weights that are assigned by default. Notice that they are a vector of 1's
```

%% Output

    array([[1.],
           [1.],
           [1.],
           [1.]])

%% Cell type:code id: tags:

``` python
w = np.random.random((4,)) # initializing weights with random vector of size 20x1
dataset_with_weights = NumpyDataset(data, labels, w) # creates numpy dataset object
```

%% Cell type:code id: tags:

``` python
dataset_with_weights.w
```

%% Output

    array([[0.8369533 ],
           [0.52828242],
           [0.43185016],
           [0.99442685]])
    array([[0.48623774],
           [0.45697711],
           [0.73580925],
           [0.17499485]])

%% Cell type:markdown id: tags:

## Iterating over NumpyDataset
In order to iterate over NumpyDataset, we use `itersamples` method. We iterate over 4 quantities, namely `X`, `y`, `w` and `ids`. The first three quantities are the same as discussed above and `ids` is the id of the data instance. By default the id is given in order starting from `1`

%% Cell type:code id: tags:

``` python
for x, y, w, id in dataset.itersamples():
    print(x, y, w, id)
```

%% Output

    (array([0.85221987, 0.47412003, 0.71233837, 0.59094892]), array([0.75443686]), array([1.]), 0)
    (array([0.39387594, 0.99322661, 0.75225026, 0.00995347]), array([0.78473712]), array([1.]), 1)
    (array([0.24524296, 0.96471994, 0.41466874, 0.99579889]), array([0.6223576]), array([1.]), 2)
    (array([0.01912096, 0.99213349, 0.61235698, 0.06214374]), array([0.53884944]), array([1.]), 3)
    (array([0.77618534, 0.76896038, 0.43433514, 0.69623474]), array([0.52468291]), array([1.]), 0)
    (array([0.23229041, 0.40810229, 0.28852268, 0.83997671]), array([0.45188867]), array([1.]), 1)
    (array([0.11555096, 0.94556341, 0.04440153, 0.49396037]), array([0.16465562]), array([1.]), 2)
    (array([0.71786872, 0.13169183, 0.28161187, 0.789942  ]), array([0.57194239]), array([1.]), 3)

%% Cell type:markdown id: tags:

You can also extract the ids by `dataset.ids`. This would return a numpy array consisting of the ids of the data instances.

%% Cell type:code id: tags:

``` python
dataset.ids
```

%% Output

    array([0, 1, 2, 3], dtype=object)

%% Cell type:markdown id: tags:

## MNIST Example
Just to get a better understanding, lets take read MNIST data and use `NumpyDataset` to store the data.

%% Cell type:code id: tags:

``` python
from tensorflow.examples.tutorials.mnist import input_data
```

%% Cell type:code id: tags:

``` python
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
```

%% Output

    Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
    Extracting MNIST_data/train-images-idx3-ubyte.gz
    Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
    Extracting MNIST_data/train-labels-idx1-ubyte.gz
    Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
    Extracting MNIST_data/t10k-images-idx3-ubyte.gz
    Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
    Extracting MNIST_data/t10k-labels-idx1-ubyte.gz

%% Cell type:code id: tags:

``` python
# Load the numpy data of MNIST into NumpyDataset
train = NumpyDataset(mnist.train.images, mnist.train.labels)
valid = NumpyDataset(mnist.validation.images, mnist.validation.labels)
```

%% Cell type:code id: tags:

``` python
import matplotlib.pyplot as plt
```

%% Output

    /home/skand/anaconda2/lib/python2.7/site-packages/matplotlib/font_manager.py:281: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
      'Matplotlib is building the font cache using fc-list. '

%% Cell type:code id: tags:

``` python
# Visualize one sample
sample = np.reshape(train.X[5], (28, 28))
plt.imshow(sample)
plt.show()
```

%% Output


%% Cell type:markdown id: tags:

## Numpy Array to tf.data.dataset()
This is quite similar to getting a `NumpyDataset` object from numpy arrays.

%% Cell type:code id: tags:

``` python
data_small = np.random.random((4,5))
label_small = np.random.random((4,))
dataset = tf.data.Dataset.from_tensor_slices((data_small, label_small))
print ("Data\n")
print (data_small)
print ("\n Labels")
print (label_small)
```

%% Output

    Data
    
    [[0.23625013 0.95037018 0.45849741 0.0319606  0.86096336]
     [0.86099849 0.85659142 0.59881206 0.08273143 0.59373341]
     [0.45409348 0.2513604  0.78284138 0.70201287 0.6632621 ]
     [0.6320499  0.49423553 0.24832246 0.85058743 0.98125345]]
    
     Labels
    [0.00552675 0.65700502 0.17774361 0.39469537]

%% Cell type:markdown id: tags:

## Extracting the numpy dataset from tf.data
In order to extract the numpy array from the `tf.data`, you first need to define an `iterator` to iterate over the `tf.data.Dataset` object and then in the tensorflow session, run over the iterator to get the data instances. Let's have a look at how it's done.

%% Cell type:code id: tags:

``` python
train.ids
iterator = dataset.make_one_shot_iterator() # iterator
next_element = iterator.get_next()
numpy_data = np.zeros((4, 5))
numpy_label = np.zeros((4,))
sess = tf.Session() # tensorflow session
for i in range(4):
    data_, label_ = sess.run(next_element) # data_ contains the data and label_ contains the labels that we fed in the previous step
    numpy_data[i, :] = data_
    numpy_label[i] = label_

print ("Numpy Data")
print(numpy_data)
print ("\n Numpy Label")
print(numpy_label)
```

%% Output

    array([0, 1, 2, ..., 54997, 54998, 54999], dtype=object)
    Numpy Data
    [[0.23625013 0.95037018 0.45849741 0.0319606  0.86096336]
     [0.86099849 0.85659142 0.59881206 0.08273143 0.59373341]
     [0.45409348 0.2513604  0.78284138 0.70201287 0.6632621 ]
     [0.6320499  0.49423553 0.24832246 0.85058743 0.98125345]]
    
     Numpy Label
    [0.00552675 0.65700502 0.17774361 0.39469537]

%% Cell type:markdown id: tags:

Now that you have the numpy arrays of `data` and `labels`, you can convert it to `NumpyDataset`.

%% Cell type:code id: tags:

``` python
dataset_ = NumpyDataset(numpy_data, numpy_label) # convert to NumpyDataset
```