Merge pull request #1116 from pvskand/master (5fec173e) · Commits · 钟慕尧 / deepchem

examples/notebooks/Deepchem_NumpyDataset_tutorial.ipynb

0 → 100644

+325 −0

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		# Using Deepchem Datasets
		In this tutorial we will have a look at various deepchem `dataset` methods present in `deepchem.datasets`.

		%% Cell type:code id: tags:

		``` python
		import deepchem as dc
		import numpy as np
		import random
		```

		%% Output

		/home/skand/anaconda2/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
		from ._conv import register_converters as _register_converters

		%% Cell type:markdown id: tags:

		# Using NumpyDatasets
		This is used when you have your data in numpy arrays.

		%% Cell type:code id: tags:

		``` python
		# data is your dataset in numpy array of size : 20x20.
		data = np.random.random((4, 4))
		labels = np.random.random((4,)) # labels of size 20x1
		```

		%% Cell type:code id: tags:

		``` python
		from deepchem.data.datasets import NumpyDataset # import NumpyDataset
		```

		%% Cell type:code id: tags:

		``` python
		dataset = NumpyDataset(data, labels) # creates numpy dataset object
		```

		%% Cell type:markdown id: tags:

		## Extracting X, y from NumpyDataset Object
		Extracting the data and labels from the NumpyDataset is very easy.

		%% Cell type:code id: tags:

		``` python
		dataset.X # Extracts the data (X) from the NumpyDataset Object
		```

		%% Output

		array([[0.63188616, 0.24690483, 0.85294168, 0.15512774],
		[0.62009111, 0.00525149, 0.56082693, 0.0649767 ],
		[0.57476389, 0.92047762, 0.36311505, 0.53421993],
		[0.5768823 , 0.51945064, 0.9655427 , 0.82099216]])

		%% Cell type:code id: tags:

		``` python
		dataset.y # Extracts the labels (y) from the NumpyDataset Object
		```

		%% Output

		array([[0.5102078 ],
		[0.76199464],
		[0.77398379],
		[0.09498917]])

		%% Cell type:markdown id: tags:

		## Weights of a dataset - w
		So apart from `X` and `y` which are the data and the labels, you can also assign weights `w` to each data instance. The dimension of `w` is same as that of `y`(which is Nx1 where N is the number of data instances).

		NOTE: By default `w` is a vector initialized with equal weights (all being 1).

		%% Cell type:code id: tags:

		``` python
		dataset.w # printing the weights that are assigned by default. Notice that they are a vector of 1's
		```

		%% Output

		array([[1.],
		[1.],
		[1.],
		[1.]])

		%% Cell type:code id: tags:

		``` python
		w = np.random.random((4,)) # initializing weights with random vector of size 20x1
		dataset_with_weights = NumpyDataset(data, labels, w) # creates numpy dataset object
		```

		%% Cell type:code id: tags:

		``` python
		dataset_with_weights.w
		```

		%% Output

		array([[0.85432113],
		[0.91847254],
		[0.59774769],
		[0.36659207]])

		%% Cell type:markdown id: tags:

		## Iterating over NumpyDataset
		In order to iterate over NumpyDataset, we use `itersamples` method. We iterate over 4 quantities, namely `X`, `y`, `w` and `ids`. The first three quantities are the same as discussed above and `ids` is the id of the data instance. By default the id is given in order starting from `1`

		%% Cell type:code id: tags:

		``` python
		for x, y, w, id in dataset.itersamples():
		print(x, y, w, id)
		```

		%% Output

		(array([0.63188616, 0.24690483, 0.85294168, 0.15512774]), array([0.5102078]), array([1.]), 0)
		(array([0.62009111, 0.00525149, 0.56082693, 0.0649767 ]), array([0.76199464]), array([1.]), 1)
		(array([0.57476389, 0.92047762, 0.36311505, 0.53421993]), array([0.77398379]), array([1.]), 2)
		(array([0.5768823 , 0.51945064, 0.9655427 , 0.82099216]), array([0.09498917]), array([1.]), 3)

		%% Cell type:markdown id: tags:

		You can also extract the ids by `dataset.ids`. This would return a numpy array consisting of the ids of the data instances.

		%% Cell type:code id: tags:

		``` python
		dataset.ids
		```

		%% Output

		array([0, 1, 2, 3], dtype=object)

		%% Cell type:markdown id: tags:

		## MNIST Example
		Just to get a better understanding, lets take read MNIST data and use `NumpyDataset` to store the data.

		%% Cell type:code id: tags:

		``` python
		from tensorflow.examples.tutorials.mnist import input_data
		```

		%% Cell type:code id: tags:

		``` python
		mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
		```

		%% Output

		Extracting MNIST_data/train-images-idx3-ubyte.gz
		Extracting MNIST_data/train-labels-idx1-ubyte.gz
		Extracting MNIST_data/t10k-images-idx3-ubyte.gz
		Extracting MNIST_data/t10k-labels-idx1-ubyte.gz

		%% Cell type:code id: tags:

		``` python
		# Load the numpy data of MNIST into NumpyDataset
		train = NumpyDataset(mnist.train.images, mnist.train.labels)
		valid = NumpyDataset(mnist.validation.images, mnist.validation.labels)
		```

		%% Cell type:code id: tags:

		``` python
		import matplotlib.pyplot as plt
		```

		%% Cell type:code id: tags:

		``` python
		# Visualize one sample
		sample = np.reshape(train.X[5], (28, 28))
		plt.imshow(sample)
		plt.show()
		```

		%% Output



		%% Cell type:markdown id: tags:

		## Numpy Array to tf.data.dataset()
		This is quite similar to getting a `NumpyDataset` object from numpy arrays.

		%% Cell type:code id: tags:

		``` python
		import tensorflow as tf
		data_small = np.random.random((4,5))
		label_small = np.random.random((4,))
		dataset = tf.data.Dataset.from_tensor_slices((data_small, label_small))
		print ("Data\n")
		print (data_small)
		print ("\n Labels")
		print (label_small)
		```

		%% Output

		Data

		[[0.78574579 0.79398959 0.64737371 0.20447343 0.55009141]
		[0.39201333 0.12299678 0.69700424 0.57494847 0.59895521]
		[0.711899 0.22786574 0.6436164 0.49713391 0.31487844]
		[0.95354154 0.67493395 0.84554228 0.15894518 0.0154379 ]]

		Labels
		[0.61605796 0.07695742 0.1084755 0.30322915]

		%% Cell type:markdown id: tags:

		## Extracting the numpy dataset from tf.data
		In order to extract the numpy array from the `tf.data`, you first need to define an `iterator` to iterate over the `tf.data.Dataset` object and then in the tensorflow session, run over the iterator to get the data instances. Let's have a look at how it's done.

		%% Cell type:code id: tags:

		``` python
		iterator = dataset.make_one_shot_iterator() # iterator
		next_element = iterator.get_next()
		numpy_data = np.zeros((4, 5))
		numpy_label = np.zeros((4,))
		sess = tf.Session() # tensorflow session
		for i in range(4):
		data_, label_ = sess.run(next_element) # data_ contains the data and label_ contains the labels that we fed in the previous step
		numpy_data[i, :] = data_
		numpy_label[i] = label_

		print ("Numpy Data")
		print(numpy_data)
		print ("\n Numpy Label")
		print(numpy_label)
		```

		%% Output

		Numpy Data
		[[0.78574579 0.79398959 0.64737371 0.20447343 0.55009141]
		[0.39201333 0.12299678 0.69700424 0.57494847 0.59895521]
		[0.711899 0.22786574 0.6436164 0.49713391 0.31487844]
		[0.95354154 0.67493395 0.84554228 0.15894518 0.0154379 ]]

		Numpy Label
		[0.61605796 0.07695742 0.1084755 0.30322915]

		%% Cell type:markdown id: tags:

		Now that you have the numpy arrays of `data` and `labels`, you can convert it to `NumpyDataset`.

		%% Cell type:code id: tags:

		``` python
		dataset_ = NumpyDataset(numpy_data, numpy_label) # convert to NumpyDataset
		dataset_.X # printing just to check if the data is same!!
		```

		%% Output

		array([[0.78574579, 0.79398959, 0.64737371, 0.20447343, 0.55009141],
		[0.39201333, 0.12299678, 0.69700424, 0.57494847, 0.59895521],
		[0.711899 , 0.22786574, 0.6436164 , 0.49713391, 0.31487844],
		[0.95354154, 0.67493395, 0.84554228, 0.15894518, 0.0154379 ]])

		%% Cell type:markdown id: tags:

		## Converting NumpyDataset to `tf.data`
		This can be easily done by the `make_iterator()` method of `NumpyDataset`. This converts the `NumpyDataset` to `tf.data`. Let's look how it's done!

		%% Cell type:code id: tags:

		``` python
		iterator_ = dataset_.make_iterator() # Using make_iterator for converting NumpyDataset to tf.data
		next_element_ = iterator_.get_next()

		sess = tf.Session() # tensorflow session
		data_and_labels = sess.run(next_element_) # data_ contains the data and label_ contains the labels that we fed in the previous step


		print ("Numpy Data")
		print(data_and_labels[0]) # Data in the first index
		print ("\n Numpy Label")
		print(data_and_labels[1]) # Labels in the second index
		```

		%% Output

		Numpy Data
		[[0.78574579 0.79398959 0.64737371 0.20447343 0.55009141]
		[0.95354154 0.67493395 0.84554228 0.15894518 0.0154379 ]
		[0.711899 0.22786574 0.6436164 0.49713391 0.31487844]
		[0.39201333 0.12299678 0.69700424 0.57494847 0.59895521]]

		Numpy Label
		[[0.61605796]
		[0.30322915]
		[0.1084755 ]
		[0.07695742]]

		%% Cell type:code id: tags:

		``` python
		```

		%% Cell type:code id: tags:

		``` python
		```

Admin message