Merge pull request #1763 from deepchem/expand_tutorial (762b7f58) · Commits · 钟慕尧 / deepchem

examples/notebooks/Estimators.ipynb

+0 −0

Original line number	Diff line number	Diff line

examples/notebooks/Large_Scale_Chemical_Screens.ipynb

+193 −11

File changed.

Preview size limit exceeded, changes collapsed.

examples/notebooks/Multitask_Networks_on_MUV.ipynb

deleted100644 → 0

+0 −392

File deleted.

Preview size limit exceeded, changes collapsed.

examples/notebooks/Splitters_Tutorial.ipynb

deleted100644 → 0

+0 −371

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		# Using Splitters

		%% Cell type:markdown id: tags:

		In this tutorial we will have a look at the various splitters that are present in deepchem library and how each of them can be used.

		%% Cell type:code id: tags:

		``` python
		import deepchem as dc
		import pandas as pd
		import os
		```

		%% Output

		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
		warnings.warn(msg, category=FutureWarning)
		RDKit WARNING: [18:47:51] Enabling RDKit 2019.09.3 jupyter extensions
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_qint8 = np.dtype([("qint8", np.int8, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_qint16 = np.dtype([("qint16", np.int16, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_qint32 = np.dtype([("qint32", np.int32, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		np_resource = np.dtype([("resource", np.ubyte, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_qint8 = np.dtype([("qint8", np.int8, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_qint16 = np.dtype([("qint16", np.int16, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		_np_qint32 = np.dtype([("qint32", np.int32, 1)])
		/Users/bharath/opt/anaconda3/envs/deepchem/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
		np_resource = np.dtype([("resource", np.ubyte, 1)])

		%% Cell type:markdown id: tags:

		## Index Splitter

		%% Cell type:markdown id: tags:

		We start with the IndexSplitter. This splitter returns a range object which contains the split according to the fractions provided by the user. The three range objects can then be used to iterate over the dataset as test,valid and Train.


		Each of the splitters that will be used has two functions inherited from the main class that are `train_test_split` which can be used to split the data into training and tesing data and the other fucnction is `train_valid_test_split` which is used to split the data to train, validation and test split.

		Note: All the splitters have a default percentage of 80,10,10 as train, valid and test respectively. But can be changed by specifying the `frac_train`,`frac_test` and `frac_valid` in the ratio we want to split the data.

		%% Cell type:code id: tags:

		``` python
		current_dir=os.path.dirname(os.path.realpath('__file__'))
		input_data=os.path.join(current_dir,'../../deepchem/models/tests/example.csv')
		```

		%% Cell type:markdown id: tags:

		We then featurize the data using any one of the featurizers present.

		%% Cell type:code id: tags:

		``` python
		tasks=['log-solubility']
		featurizer=dc.feat.CircularFingerprint(size=1024)
		loader = dc.data.CSVLoader(tasks=tasks, smiles_field="smiles",featurizer=featurizer)
		dataset=loader.featurize(input_data)
		```

		%% Output

		Loading raw samples now.
		shard_size: 8192
		About to start loading CSV from /Users/bharath/Code/deepchem/examples/notebooks/../../deepchem/models/tests/example.csv
		Loading shard 1 of size 8192.
		Featurizing sample 0
		TIMING: featurizing shard 0 took 0.024 s
		TIMING: dataset construction took 0.039 s
		Loading dataset from disk.

		%% Cell type:code id: tags:

		``` python
		from deepchem.splits.splitters import IndexSplitter
		```

		%% Cell type:code id: tags:

		``` python
		splitter=IndexSplitter()
		train_data,valid_data,test_data=splitter.split(dataset)
		```

		%% Cell type:code id: tags:

		``` python
		train_data=[i for i in train_data]
		valid_data=[i for i in valid_data]
		test_data=[i for i in test_data]
		```

		%% Cell type:code id: tags:

		``` python
		len(train_data),len(valid_data),len(test_data)
		```

		%% Output

		(8, 1, 1)

		%% Cell type:markdown id: tags:

		As we can see that without providing the user specifications on how to split the data, the data was split into a default of 80,10,10.

		But when we specify the parameters the dataset can be split according to our specificaitons.

		%% Cell type:code id: tags:

		``` python
		train_data,valid_data,test_data=splitter.split(dataset,frac_train=0.7,frac_valid=0.2,frac_test=0.1)
		```

		%% Cell type:code id: tags:

		``` python
		train_data=[i for i in train_data]
		valid_data=[i for i in valid_data]
		test_data=[i for i in test_data]
		```

		%% Cell type:code id: tags:

		``` python
		len(train_data),len(valid_data),len(test_data)
		```

		%% Output

		(7, 2, 1)

		%% Cell type:markdown id: tags:

		## Specified Splitter

		%% Cell type:markdown id: tags:

		The next splitter that is present in the library is the specified splitter. This splitter needs a list from the dataset where it is specified which data is for training and which is for validation and testing.

		%% Cell type:code id: tags:

		``` python
		from deepchem.splits.splitters import SpecifiedSplitter
		current_dir=os.path.dirname(os.path.realpath('__file__'))
		input_file=os.path.join('../../deepchem/models/tests/user_specified_example.csv')

		tasks=['log-solubility']
		featurizer=dc.feat.CircularFingerprint(size=1024)
		loader = dc.data.CSVLoader(tasks=tasks, smiles_field="smiles",featurizer=featurizer)
		dataset=loader.featurize(input_file)

		split_field='split'

		splitter=SpecifiedSplitter(input_file,split_field)
		```

		%% Output

		Loading raw samples now.
		shard_size: 8192
		About to start loading CSV from ../../deepchem/models/tests/user_specified_example.csv
		Loading shard 1 of size 8192.
		Featurizing sample 0
		TIMING: featurizing shard 0 took 0.017 s
		TIMING: dataset construction took 0.028 s
		Loading dataset from disk.

		%% Cell type:code id: tags:

		``` python
		train_data,valid_data,test_data=splitter.split(dataset)
		```

		%% Cell type:markdown id: tags:

		When we split the data using the specified splitter it compares the data in each row of the `split_field` which the user has to specify wether the given row should be used as training data, validation data or testing data. The user has to specify as `train`,`test` and `valid` in the `split_field`.
		Note: The input is case insensitive.

		%% Cell type:code id: tags:

		``` python
		train_data,valid_data,test_data
		```

		%% Output

		([0, 1, 2, 3, 4, 5], [6, 7], [8, 9])

		%% Cell type:markdown id: tags:

		## Indice Splitter

		%% Cell type:markdown id: tags:

		Another splitter present in the fraework is `IndiceSplitter`. This splitter takes an input of valid_indices and test_indices which are lists with the indices of validation data and test data in the dataset respectively.

		%% Cell type:code id: tags:

		``` python
		from deepchem.splits.splitters import IndiceSplitter

		splitter=IndiceSplitter(valid_indices=[7],test_indices=[9])
		```

		%% Cell type:code id: tags:

		``` python
		splitter.split(dataset)
		```

		%% Output

		([0, 1, 2, 3, 4, 5, 6, 8], [7], [9])

		%% Cell type:markdown id: tags:

		## RandomGroupSplitter

		%% Cell type:markdown id: tags:

		The splitter which can be used to split the data on the basis of groupings is the `RandomGroupSplitter`. This splitter that splits on groupings.

		An example use case is when there are multiple conformations of the same molecule that share the same topology.This splitter subsequently guarantees that resulting splits preserve groupings.

		Note that it doesn't do any dynamic programming or something fancy to try to maximize the choice such that `frac_train`, `frac_valid`, or `frac_test` is maximized.It simply permutes the groups themselves. As such, use with caution if the number of elements per group varies significantly.

		The parameter that needs to be provided with the splitter is `groups`. This is an array like list of hashables which is the same as the size of the dataset.

		%% Cell type:code id: tags:

		``` python
		from deepchem.splits.splitters import RandomGroupSplitter

		groups = [0, 4, 1, 2, 3, 7, 0, 3, 1, 0]
		solubility_dataset=dc.data.tests.load_solubility_data()


		splitter=RandomGroupSplitter(groups=groups)


		train_idxs, valid_idxs, test_idxs = splitter.split(
		solubility_dataset)
		```

		%% Output

		Loading raw samples now.
		shard_size: 8192
		About to start loading CSV from /Users/bharath/Code/deepchem/deepchem/data/tests/../../models/tests/example.csv
		Loading shard 1 of size 8192.
		Featurizing sample 0
		TIMING: featurizing shard 0 took 0.017 s
		TIMING: dataset construction took 0.025 s
		Loading dataset from disk.

		%% Cell type:code id: tags:

		``` python
		train_idxs,valid_idxs,test_idxs
		```

		%% Output

		([5, 1, 2, 8, 0, 6, 9], [4, 7], [3])

		%% Cell type:code id: tags:

		``` python
		train_data=[]
		for i in range(len(train_idxs)):
		train_data.append(groups[train_idxs[i]])

		valid_data=[]
		for i in range(len(valid_idxs)):
		valid_data.append(groups[valid_idxs[i]])

		test_data=[]
		for i in range(len(test_idxs)):
		test_data.append(groups[test_idxs[i]])
		```

		%% Cell type:code id: tags:

		``` python
		print("Groups present in the training data =",train_data)
		print("Groups present in the validation data = ",valid_data)
		print("Groups present in the testing data = ", test_data)
		```

		%% Output

		Groups present in the training data = [4, 1, 1, 3, 3, 0, 0, 0]
		Groups present in the validation data = [2]
		Groups present in the testing data = [7]

		%% Cell type:markdown id: tags:

		So the `RandomGroupSplitter` when properly assigned the groups, splits the data accordingly and preserves the groupings.

		%% Cell type:markdown id: tags:

		## Scaffold Splitter

		%% Cell type:markdown id: tags:

		The `ScaffoldSplitter` splits the data based on the scaffold of small molecules. The splitter takes the data and generates scaffolds using the smiles in the data. Then the splitter sorts the data into scaffold sets.

		%% Cell type:code id: tags:

		``` python
		from deepchem.splits.splitters import ScaffoldSplitter
		```

		%% Cell type:code id: tags:

		``` python
		splitter=ScaffoldSplitter()
		solubility_dataset=dc.data.tests.load_solubility_data()
		```

		%% Output

		Loading raw samples now.
		shard_size: 8192
		About to start loading CSV from /Users/bharath/Code/deepchem/deepchem/data/tests/../../models/tests/example.csv
		Loading shard 1 of size 8192.
		Featurizing sample 0
		TIMING: featurizing shard 0 took 0.016 s
		TIMING: dataset construction took 0.025 s
		Loading dataset from disk.

		%% Cell type:code id: tags:

		``` python
		train_data,valid_data,test_data = splitter.split(solubility_dataset,frac_train=0.7,frac_valid=0.2,frac_test=0.1)
		```

		%% Cell type:code id: tags:

		``` python
		len(train_data),len(valid_data),len(test_data)
		```

		%% Output

		(7, 2, 1)

		%% Cell type:code id: tags:

		``` python
		```

examples/notebooks/Train_a_model_on_MNIST.ipynb

deleted100644 → 0

+0 −295

File deleted.

Preview size limit exceeded, changes collapsed.

Admin message