added how to convert tf.data to NumpyDataset (91c692b2) · Commits · 钟慕尧 / deepchem

examples/notebooks/Deepchem_NumpyDataset_tutorial.ipynb

+77 −32

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		# Using Deepchem Datasets
		In this tutorial we will have a look at various deepchem `dataset` methods present in `deepchem.datasets`.

		%% Cell type:code id: tags:

		``` python
		import deepchem as dc
		import numpy as np
		import random
		```

		%% Output

		/home/skand/anaconda2/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
		from ._conv import register_converters as _register_converters

		%% Cell type:markdown id: tags:

		# Using NumpyDatasets
		This is used when you have your data in numpy arrays.

		%% Cell type:code id: tags:

		``` python
		# data is your dataset in numpy array of size : 20x20.
		data = np.random.random((4, 4))
		labels = np.random.random((4,)) # labels of size 20x1
		```

		%% Cell type:code id: tags:

		``` python
		from deepchem.data.datasets import NumpyDataset # import NumpyDataset
		```

		%% Cell type:code id: tags:

		``` python
		dataset = NumpyDataset(data, labels) # creates numpy dataset object
		```

		%% Cell type:markdown id: tags:

		## Extracting X, y from NumpyDataset Object
		Extracting the data and labels from the NumpyDataset is very easy.

		%% Cell type:code id: tags:

		``` python
		dataset.X # Extracts the data (X) from the NumpyDataset Object
		```

		%% Output

		array([[0.85221987, 0.47412003, 0.71233837, 0.59094892],
		[0.39387594, 0.99322661, 0.75225026, 0.00995347],
		[0.24524296, 0.96471994, 0.41466874, 0.99579889],
		[0.01912096, 0.99213349, 0.61235698, 0.06214374]])
		array([[0.77618534, 0.76896038, 0.43433514, 0.69623474],
		[0.23229041, 0.40810229, 0.28852268, 0.83997671],
		[0.11555096, 0.94556341, 0.04440153, 0.49396037],
		[0.71786872, 0.13169183, 0.28161187, 0.789942 ]])

		%% Cell type:code id: tags:

		``` python
		dataset.y # Extracts the labels (y) from the NumpyDataset Object
		```

		%% Output

		array([[0.75443686],
		[0.78473712],
		[0.6223576 ],
		[0.53884944]])
		array([[0.52468291],
		[0.45188867],
		[0.16465562],
		[0.57194239]])

		%% Cell type:markdown id: tags:

		## Weights of a dataset - w
		So apart from `X` and `y` which are the data and the labels, you can also assign weights `w` to each data instance. The dimension of `w` is same as that of `y`(which is Nx1 where N is the number of data instances).

		NOTE: By default `w` is a vector initialized with equal weights (all being 1).

		%% Cell type:code id: tags:

		``` python
		dataset.w # printing the weights that are assigned by default. Notice that they are a vector of 1's
		```

		%% Output

		array([[1.],
		[1.],
		[1.],
		[1.]])

		%% Cell type:code id: tags:

		``` python
		w = np.random.random((4,)) # initializing weights with random vector of size 20x1
		dataset_with_weights = NumpyDataset(data, labels, w) # creates numpy dataset object
		```

		%% Cell type:code id: tags:

		``` python
		dataset_with_weights.w
		```

		%% Output

		array([[0.8369533 ],
		[0.52828242],
		[0.43185016],
		[0.99442685]])
		array([[0.48623774],
		[0.45697711],
		[0.73580925],
		[0.17499485]])

		%% Cell type:markdown id: tags:

		## Iterating over NumpyDataset
		In order to iterate over NumpyDataset, we use `itersamples` method. We iterate over 4 quantities, namely `X`, `y`, `w` and `ids`. The first three quantities are the same as discussed above and `ids` is the id of the data instance. By default the id is given in order starting from `1`

		%% Cell type:code id: tags:

		``` python
		for x, y, w, id in dataset.itersamples():
		print(x, y, w, id)
		```

		%% Output

		(array([0.85221987, 0.47412003, 0.71233837, 0.59094892]), array([0.75443686]), array([1.]), 0)
		(array([0.39387594, 0.99322661, 0.75225026, 0.00995347]), array([0.78473712]), array([1.]), 1)
		(array([0.24524296, 0.96471994, 0.41466874, 0.99579889]), array([0.6223576]), array([1.]), 2)
		(array([0.01912096, 0.99213349, 0.61235698, 0.06214374]), array([0.53884944]), array([1.]), 3)
		(array([0.77618534, 0.76896038, 0.43433514, 0.69623474]), array([0.52468291]), array([1.]), 0)
		(array([0.23229041, 0.40810229, 0.28852268, 0.83997671]), array([0.45188867]), array([1.]), 1)
		(array([0.11555096, 0.94556341, 0.04440153, 0.49396037]), array([0.16465562]), array([1.]), 2)
		(array([0.71786872, 0.13169183, 0.28161187, 0.789942 ]), array([0.57194239]), array([1.]), 3)

		%% Cell type:markdown id: tags:

		You can also extract the ids by `dataset.ids`. This would return a numpy array consisting of the ids of the data instances.

		%% Cell type:code id: tags:

		``` python
		dataset.ids
		```

		%% Output

		array([0, 1, 2, 3], dtype=object)

		%% Cell type:markdown id: tags:

		## MNIST Example
		Just to get a better understanding, lets take read MNIST data and use `NumpyDataset` to store the data.

		%% Cell type:code id: tags:

		``` python
		from tensorflow.examples.tutorials.mnist import input_data
		```

		%% Cell type:code id: tags:

		``` python
		mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
		```

		%% Output

		Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
		Extracting MNIST_data/train-images-idx3-ubyte.gz
		Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
		Extracting MNIST_data/train-labels-idx1-ubyte.gz
		Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
		Extracting MNIST_data/t10k-images-idx3-ubyte.gz
		Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
		Extracting MNIST_data/t10k-labels-idx1-ubyte.gz

		%% Cell type:code id: tags:

		``` python
		# Load the numpy data of MNIST into NumpyDataset
		train = NumpyDataset(mnist.train.images, mnist.train.labels)
		valid = NumpyDataset(mnist.validation.images, mnist.validation.labels)
		```

		%% Cell type:code id: tags:

		``` python
		import matplotlib.pyplot as plt
		```

		%% Output

		/home/skand/anaconda2/lib/python2.7/site-packages/matplotlib/font_manager.py:281: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
		'Matplotlib is building the font cache using fc-list. '

		%% Cell type:code id: tags:

		``` python
		# Visualize one sample
		sample = np.reshape(train.X[5], (28, 28))
		plt.imshow(sample)
		plt.show()
		```

		%% Output



		%% Cell type:markdown id: tags:

		## Numpy Array to tf.data.dataset()
		This is quite similar to getting a `NumpyDataset` object from numpy arrays.

		%% Cell type:code id: tags:

		``` python
		data_small = np.random.random((4,5))
		label_small = np.random.random((4,))
		dataset = tf.data.Dataset.from_tensor_slices((data_small, label_small))
		print ("Data\n")
		print (data_small)
		print ("\n Labels")
		print (label_small)
		```

		%% Output

		Data

		[[0.23625013 0.95037018 0.45849741 0.0319606 0.86096336]
		[0.86099849 0.85659142 0.59881206 0.08273143 0.59373341]
		[0.45409348 0.2513604 0.78284138 0.70201287 0.6632621 ]
		[0.6320499 0.49423553 0.24832246 0.85058743 0.98125345]]

		Labels
		[0.00552675 0.65700502 0.17774361 0.39469537]

		%% Cell type:markdown id: tags:

		## Extracting the numpy dataset from tf.data
		In order to extract the numpy array from the `tf.data`, you first need to define an `iterator` to iterate over the `tf.data.Dataset` object and then in the tensorflow session, run over the iterator to get the data instances. Let's have a look at how it's done.

		%% Cell type:code id: tags:

		``` python
		train.ids
		iterator = dataset.make_one_shot_iterator() # iterator
		next_element = iterator.get_next()
		numpy_data = np.zeros((4, 5))
		numpy_label = np.zeros((4,))
		sess = tf.Session() # tensorflow session
		for i in range(4):
		data_, label_ = sess.run(next_element) # data_ contains the data and label_ contains the labels that we fed in the previous step
		numpy_data[i, :] = data_
		numpy_label[i] = label_

		print ("Numpy Data")
		print(numpy_data)
		print ("\n Numpy Label")
		print(numpy_label)
		```

		%% Output

		array([0, 1, 2, ..., 54997, 54998, 54999], dtype=object)
		Numpy Data
		[[0.23625013 0.95037018 0.45849741 0.0319606 0.86096336]
		[0.86099849 0.85659142 0.59881206 0.08273143 0.59373341]
		[0.45409348 0.2513604 0.78284138 0.70201287 0.6632621 ]
		[0.6320499 0.49423553 0.24832246 0.85058743 0.98125345]]

		Numpy Label
		[0.00552675 0.65700502 0.17774361 0.39469537]

		%% Cell type:markdown id: tags:

		Now that you have the numpy arrays of `data` and `labels`, you can convert it to `NumpyDataset`.

		%% Cell type:code id: tags:

		``` python
		dataset_ = NumpyDataset(numpy_data, numpy_label) # convert to NumpyDataset
		```

Admin message