Merge pull request #2203 from peastman/tutorials (e5f0359e) · Commits · 钟慕尧 / deepchem

examples/tutorials/02_Learning_MNIST_Digit_Classifiers.ipynb

deleted100644 → 0

+0 −125

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		# Tutorial Part 2: Learning MNIST Digit Classifiers

		In the previous tutorial, we learned some basics of how to load data into DeepChem and how to use the basic DeepChem objects to load and manipulate this data. In this tutorial, you'll put the parts together and learn how to train a basic image classification model in DeepChem. You might ask, why are we bothering to learn this material in DeepChem? Part of the reason is that image processing is an increasingly important part of AI for the life sciences. So learning how to train image processing models will be very useful for using some of the more advanced DeepChem features.

		The MNIST dataset contains handwritten digits along with their human annotated labels. The learning challenge for this dataset is to train a model that maps the digit image to its true label. MNIST has been a standard benchmark for machine learning for decades at this point.

		![MNIST](https://github.com/deepchem/deepchem/blob/master/examples/tutorials/mnist_examples.png?raw=1)

		## Colab

		This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

		[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/02_Learning_MNIST_Digit_Classifiers.ipynb)

		## Setup

		We recommend running this tutorial on Google colab. You'll need to run the following cell of installation commands on Colab to get your environment set up. If you'd rather run the tutorial locally, make sure you don't run these commands (since they'll download and install a new Anaconda python setup)

		%% Cell type:code id: tags:

		``` python
		!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
		import conda_installer
		conda_installer.install()
		!/root/miniconda/bin/conda info -e
		```

		%% Cell type:code id: tags:

		``` python
		!pip install --pre deepchem
		import deepchem
		deepchem.__version__
		```

		%% Cell type:markdown id: tags:

		First let's import the libraries we will be using and load the data (which comes bundled with Tensorflow).

		%% Cell type:code id: tags:

		``` python
		import deepchem as dc
		import tensorflow as tf
		import numpy as np
		from tensorflow.keras.layers import Reshape, Conv2D, Flatten, Dense

		mnist = tf.keras.datasets.mnist.load_data(path='mnist.npz')
		train_images = mnist[0][0].reshape((-1, 28, 28, 1))/255
		valid_images = mnist[1][0].reshape((-1, 28, 28, 1))/255
		train = dc.data.NumpyDataset(train_images, mnist[0][1])
		valid = dc.data.NumpyDataset(valid_images, mnist[1][1])
		```

		%% Cell type:markdown id: tags:

		Now create the model. We use two convolutional layers followed by two dense layers. The final layer outputs ten numbers for each sample. These correspond to the ten possible digits.

		How does the model know how to interpret the output? That is determined by the loss function. We specify `SparseSoftmaxCrossEntropy`. This is a very convenient class that implements a common case:

		1. Each label is an integer which is interpreted as a class index (i.e. which of the ten digits this sample is a drawing of).
		2. The outputs are passed through a softmax function, and the result is interpreted as a probability distribution over those same classes.

		The model learns to produce a large output for the correct class, and small outputs for all other classes.

		%% Cell type:code id: tags:

		``` python
		keras_model = tf.keras.Sequential([
		Conv2D(filters=32, kernel_size=5, activation=tf.nn.relu),
		Conv2D(filters=64, kernel_size=5, activation=tf.nn.relu),
		Flatten(),
		Dense(1024, activation=tf.nn.relu),
		Dense(10),
		])
		model = dc.models.KerasModel(keras_model, dc.models.losses.SparseSoftmaxCrossEntropy())
		```

		%% Cell type:markdown id: tags:

		Fit the model on the training set.

		%% Cell type:code id: tags:

		``` python
		model.fit(train, nb_epoch=2)
		```

		%% Output

		0.031744494438171386

		%% Cell type:markdown id: tags:

		Let's see how well it works. We ask the model to predict the class of every sample in the validation set. Remember there are ten outputs for each sample. We use `argmax()` to identify the largest one, which corresponds to the predicted class.

		%% Cell type:code id: tags:

		``` python
		prediction = np.argmax(model.predict_on_batch(valid.X), axis=1)
		score = dc.metrics.accuracy_score(prediction, valid.y)
		print('Validation set accuracy: ', score)
		```

		%% Output

		Validation set accuracy: 0.9891

		%% Cell type:markdown id: tags:

		It gets about 99% of samples correct. Not too bad for such a simple model!

		%% Cell type:markdown id: tags:

		# Congratulations! Time to join the Community!

		Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

		## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
		This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

		## Join the DeepChem Gitter
		The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!

examples/tutorials/03_Modeling_Solubility.ipynb

deleted100644 → 0

+0 −1334

File deleted.

Preview size limit exceeded, changes collapsed.

examples/tutorials/05_Putting_Multitask_Learning_to_Work.ipynb→examples/tutorials/11_Putting_Multitask_Learning_to_Work.ipynb

+25 −28

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		# Tutorial Part 5: Putting Multitask Learning to Work
		# Tutorial Part 11: Putting Multitask Learning to Work

		This notebook walks through the creation of multitask models on MUV [1]. The goal is to demonstrate how multitask methods can provide improved performance in situations with little or very unbalanced data.

		## Colab

		This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

		[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/05_Putting_Multitask_Learning_to_Work.ipynb)
		[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/11_Putting_Multitask_Learning_to_Work.ipynb)


		## Setup

		To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.

		%% Cell type:code id: tags:

		``` python
		!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
		import conda_installer
		conda_installer.install()
		!/root/miniconda/bin/conda info -e
		```

		%% Cell type:code id: tags:

		``` python
		!pip install --pre deepchem
		import deepchem
		deepchem.__version__
		```

		%% Cell type:markdown id: tags:

		The MUV dataset is a challenging benchmark in molecular design that consists of 17 different "targets" where there are only a few "active" compounds per target. There are 93,087 compounds in total, yet no task has more than 30 active compounds, and many have even less. Training a model with such a small number of positive examples is very challenging. Multitask models address this by training a single model that predicts all the different targets at once. If a feature is useful for predicting one task, it often is useful for predicting several other tasks as well. Each added task makes it easier to learn important features, which improves performance on other tasks [2].

		To get started, let's load the MUV dataset. The MoleculeNet loader function automatically splits it into training, validation, and test sets.
		To get started, let's load the MUV dataset. The MoleculeNet loader function automatically splits it into training, validation, and test sets. Because there are so few positive examples, we use stratified splitting to ensure the test set has enough of them to evaluate.

		%% Cell type:code id: tags:

		``` python
		import deepchem as dc
		import numpy as np

		tasks, datasets, transformers = dc.molnet.load_muv(split='random')
		tasks, datasets, transformers = dc.molnet.load_muv(split='stratified')
		train_dataset, valid_dataset, test_dataset = datasets
		```

		%% Cell type:markdown id: tags:

		Now let's train a model on it. We'll use a MultitaskClassifier, which is a simple stack of fully connected layers.

		%% Cell type:code id: tags:

		``` python
		n_tasks = len(tasks)
		n_features = train_dataset.get_data_shape()[0]
		model = dc.models.MultitaskClassifier(n_tasks, n_features)
		model.fit(train_dataset)
		```

		%% Output

		0.0005275170505046844
		0.0004961589723825455

		%% Cell type:markdown id: tags:

		Let's see how well it does on the test set. We loop over the 17 tasks and compute the ROC AUC for each one. We need to be a little careful when doing this. Because there are so few positive samples in the dataset, it is possible the test set could have ended up with none at all for some tasks. To ensure we have enough data to compute a meaningful result, we only compute the score for tasks with at least three positive samples.
		Let's see how well it does on the test set. We loop over the 17 tasks and compute the ROC AUC for each one.

		%% Cell type:code id: tags:

		``` python
		y_true = test_dataset.y
		y_pred = model.predict(test_dataset)
		metric = dc.metrics.roc_auc_score
		for i in range(n_tasks):
		if np.sum(y_true[:,i]) > 2:
		score = metric(dc.metrics.to_one_hot(y_true[:,i]), y_pred[:,i])
		print(tasks[i], score)
		else:
		print(tasks[i], 'Not enough positives in test set')
		score = metric(dc.metrics.to_one_hot(y_true[:,i]), y_pred[:,i])
		print(tasks[i], score)
		```

		%% Output

		MUV-466 0.8244303525365435
		MUV-548 0.9732469102632992
		MUV-600 0.9187262697900995
		MUV-644 Not enough positives in test set
		MUV-652 0.7619760881246641
		MUV-689 0.9622734436564224
		MUV-692 0.5174011177987962
		MUV-712 0.5857469102632993
		MUV-713 Not enough positives in test set
		MUV-733 Not enough positives in test set
		MUV-737 Not enough positives in test set
		MUV-810 0.6271829661472326
		MUV-832 0.6916684576259045
		MUV-846 0.9023643202579259
		MUV-852 0.7483207952713595
		MUV-858 0.9691686367218282
		MUV-859 0.46041420277389533
		MUV-466 0.9207684040838259
		MUV-548 0.7480655561526062
		MUV-600 0.9927995701235895
		MUV-644 0.9974207415368082
		MUV-652 0.7823481998925309
		MUV-689 0.6636843990686011
		MUV-692 0.6319093677234462
		MUV-712 0.7787838079885365
		MUV-713 0.7910711087229088
		MUV-733 0.4401307540748701
		MUV-737 0.34679383843811573
		MUV-810 0.9564571019165323
		MUV-832 0.9991044241447251
		MUV-846 0.7519881783987103
		MUV-852 0.8516747268493642
		MUV-858 0.5906591438294824
		MUV-859 0.5962954008166774

		%% Cell type:markdown id: tags:

		Not bad! Recall that random guessing would produce a ROC AUC score of 0.5, and a perfect predictor would score 1.0. Most of the tasks did much better than random guessing, and many of them are above 0.9.

		# Congratulations! Time to join the Community!

		Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

		## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
		This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

		## Join the DeepChem Gitter
		The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!

		# Bibliography

		[1] https://pubs.acs.org/doi/10.1021/ci8002649

		[2] https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00146

		%% Cell type:code id: tags:

		``` python
		```

examples/tutorials/20_Converting_DeepChem_Models_to_TensorFlow_Estimators.ipynb

deleted100644 → 0

+0 −390

File deleted.

Preview size limit exceeded, changes collapsed.

Admin message