More updates to tutorial sequence (fea4c40a) · Commits · 钟慕尧 / deepchem

examples/tutorials/03_Modeling_Solubility.ipynb

deleted100644 → 0

+0 −1334

File deleted.

Preview size limit exceeded, changes collapsed.

examples/tutorials/05_Putting_Multitask_Learning_to_Work.ipynb→examples/tutorials/11_Putting_Multitask_Learning_to_Work.ipynb

+25 −28

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		# Tutorial Part 5: Putting Multitask Learning to Work
		# Tutorial Part 11: Putting Multitask Learning to Work

		This notebook walks through the creation of multitask models on MUV [1]. The goal is to demonstrate how multitask methods can provide improved performance in situations with little or very unbalanced data.

		## Colab

		This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

		[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/05_Putting_Multitask_Learning_to_Work.ipynb)
		[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/11_Putting_Multitask_Learning_to_Work.ipynb)


		## Setup

		To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.

		%% Cell type:code id: tags:

		``` python
		!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
		import conda_installer
		conda_installer.install()
		!/root/miniconda/bin/conda info -e
		```

		%% Cell type:code id: tags:

		``` python
		!pip install --pre deepchem
		import deepchem
		deepchem.__version__
		```

		%% Cell type:markdown id: tags:

		The MUV dataset is a challenging benchmark in molecular design that consists of 17 different "targets" where there are only a few "active" compounds per target. There are 93,087 compounds in total, yet no task has more than 30 active compounds, and many have even less. Training a model with such a small number of positive examples is very challenging. Multitask models address this by training a single model that predicts all the different targets at once. If a feature is useful for predicting one task, it often is useful for predicting several other tasks as well. Each added task makes it easier to learn important features, which improves performance on other tasks [2].

		To get started, let's load the MUV dataset. The MoleculeNet loader function automatically splits it into training, validation, and test sets.
		To get started, let's load the MUV dataset. The MoleculeNet loader function automatically splits it into training, validation, and test sets. Because there are so few positive examples, we use stratified splitting to ensure the test set has enough of them to evaluate.

		%% Cell type:code id: tags:

		``` python
		import deepchem as dc
		import numpy as np

		tasks, datasets, transformers = dc.molnet.load_muv(split='random')
		tasks, datasets, transformers = dc.molnet.load_muv(split='stratified')
		train_dataset, valid_dataset, test_dataset = datasets
		```

		%% Cell type:markdown id: tags:

		Now let's train a model on it. We'll use a MultitaskClassifier, which is a simple stack of fully connected layers.

		%% Cell type:code id: tags:

		``` python
		n_tasks = len(tasks)
		n_features = train_dataset.get_data_shape()[0]
		model = dc.models.MultitaskClassifier(n_tasks, n_features)
		model.fit(train_dataset)
		```

		%% Output

		0.0005275170505046844
		0.0004961589723825455

		%% Cell type:markdown id: tags:

		Let's see how well it does on the test set. We loop over the 17 tasks and compute the ROC AUC for each one. We need to be a little careful when doing this. Because there are so few positive samples in the dataset, it is possible the test set could have ended up with none at all for some tasks. To ensure we have enough data to compute a meaningful result, we only compute the score for tasks with at least three positive samples.
		Let's see how well it does on the test set. We loop over the 17 tasks and compute the ROC AUC for each one.

		%% Cell type:code id: tags:

		``` python
		y_true = test_dataset.y
		y_pred = model.predict(test_dataset)
		metric = dc.metrics.roc_auc_score
		for i in range(n_tasks):
		if np.sum(y_true[:,i]) > 2:
		score = metric(dc.metrics.to_one_hot(y_true[:,i]), y_pred[:,i])
		print(tasks[i], score)
		else:
		print(tasks[i], 'Not enough positives in test set')
		score = metric(dc.metrics.to_one_hot(y_true[:,i]), y_pred[:,i])
		print(tasks[i], score)
		```

		%% Output

		MUV-466 0.8244303525365435
		MUV-548 0.9732469102632992
		MUV-600 0.9187262697900995
		MUV-644 Not enough positives in test set
		MUV-652 0.7619760881246641
		MUV-689 0.9622734436564224
		MUV-692 0.5174011177987962
		MUV-712 0.5857469102632993
		MUV-713 Not enough positives in test set
		MUV-733 Not enough positives in test set
		MUV-737 Not enough positives in test set
		MUV-810 0.6271829661472326
		MUV-832 0.6916684576259045
		MUV-846 0.9023643202579259
		MUV-852 0.7483207952713595
		MUV-858 0.9691686367218282
		MUV-859 0.46041420277389533
		MUV-466 0.9207684040838259
		MUV-548 0.7480655561526062
		MUV-600 0.9927995701235895
		MUV-644 0.9974207415368082
		MUV-652 0.7823481998925309
		MUV-689 0.6636843990686011
		MUV-692 0.6319093677234462
		MUV-712 0.7787838079885365
		MUV-713 0.7910711087229088
		MUV-733 0.4401307540748701
		MUV-737 0.34679383843811573
		MUV-810 0.9564571019165323
		MUV-832 0.9991044241447251
		MUV-846 0.7519881783987103
		MUV-852 0.8516747268493642
		MUV-858 0.5906591438294824
		MUV-859 0.5962954008166774

		%% Cell type:markdown id: tags:

		Not bad! Recall that random guessing would produce a ROC AUC score of 0.5, and a perfect predictor would score 1.0. Most of the tasks did much better than random guessing, and many of them are above 0.9.

		# Congratulations! Time to join the Community!

		Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

		## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
		This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

		## Join the DeepChem Gitter
		The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!

		# Bibliography

		[1] https://pubs.acs.org/doi/10.1021/ci8002649

		[2] https://pubs.acs.org/doi/abs/10.1021/acs.jcim.7b00146

		%% Cell type:code id: tags:

		``` python
		```

examples/tutorials/20_Converting_DeepChem_Models_to_TensorFlow_Estimators.ipynb

deleted100644 → 0

+0 −189

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		# Tutorial Part 20: Converting DeepChem models to TensorFlow Estimators

		So far, we've walked through a lot of the scientific details tied to molecular machine learning, but we haven't discussed as much how to use tools like DeepChem in production settings. This tutorial (and the last) focus more on the practical matters of how to use DeepChem in production settings.

		When DeepChem was first created, Tensorflow had no standard interface for datasets or models. We created the Dataset and Model classes to fill this hole. More recently, Tensorflow has added the `tf.data` module as a standard interface for datasets, and the `tf.estimator` module as a standard interface for models. To enable easy interoperability with other tools, we have added features to Dataset and Model to support these new standards. Using the Estimator interface may make it easier to deply DeepChem models in production environments.

		This example demonstrates how to use these features. Let's begin by loading a dataset and creating a model to analyze it. We'll use a simple MultitaskClassifier with one hidden layer.

		## Colab

		This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

		[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/20_Converting_DeepChem_Models_to_TensorFlow_Estimators.ipynb)

		## Setup

		To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.

		%% Cell type:code id: tags:

		```
		!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
		import conda_installer
		conda_installer.install()
		!/root/miniconda/bin/conda info -e
		```

		%% Output

		% Total % Received % Xferd Average Speed Time Time Time Current
		Dload Upload Total Spent Left Speed
		0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 3489 100 3489 0 0 28834 0 --:--:-- --:--:-- --:--:-- 28834

		add /root/miniconda/lib/python3.6/site-packages to PYTHONPATH
		all packages is already installed

		# conda environments:
		#
		base * /root/miniconda


		%% Cell type:code id: tags:

		```
		!pip install --pre deepchem
		import deepchem
		deepchem.__version__
		```

		%% Output

		Requirement already satisfied: deepchem in /usr/local/lib/python3.6/dist-packages (2.4.0rc1.dev20200805145942)
		Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from deepchem) (0.16.0)
		Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.4.1)
		Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from deepchem) (0.22.2.post1)
		Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.18.5)
		Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.0.5)
		Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->deepchem) (2018.9)
		Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas->deepchem) (2.8.1)
		Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas->deepchem) (1.15.0)

		'2.4.0-rc1.dev'

		%% Cell type:code id: tags:

		```
		import deepchem as dc
		import tensorflow as tf
		import numpy as np

		tasks, datasets, transformers = dc.molnet.load_tox21(reload=False)
		train_dataset, valid_dataset, test_dataset = datasets
		n_tasks = len(tasks)
		n_features = train_dataset.X.shape[1]

		model = dc.models.MultitaskClassifier(n_tasks, n_features, layer_sizes=[1000], dropouts=0.25)
		```

		%% Output

		smiles_field is deprecated and will be removed in a future version of DeepChem. Use feature_field instead.
		/usr/local/lib/python3.6/dist-packages/deepchem/data/data_loader.py:198: FutureWarning: featurize() is deprecated and has been renamed to create_dataset(). featurize() will be removed in DeepChem 3.0
		FutureWarning)

		%% Cell type:markdown id: tags:

		We want to train the model using the training set, then evaluate it on the test set. As our evaluation metric we will use the ROC AUC, averaged over the 12 tasks included in the dataset. First let's see how to do this with the DeepChem API.

		%% Cell type:code id: tags:

		```
		model.fit(train_dataset, nb_epoch=10)
		metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean)
		print(model.evaluate(test_dataset, [metric]))
		```

		%% Output

		n_samples is a deprecated argument which is ignored.
		n_samples is a deprecated argument which is ignored.
		n_samples is a deprecated argument which is ignored.
		n_samples is a deprecated argument which is ignored.
		n_samples is a deprecated argument which is ignored.
		n_samples is a deprecated argument which is ignored.
		n_samples is a deprecated argument which is ignored.
		n_samples is a deprecated argument which is ignored.
		n_samples is a deprecated argument which is ignored.
		n_samples is a deprecated argument which is ignored.
		n_samples is a deprecated argument which is ignored.
		n_samples is a deprecated argument which is ignored.

		{'mean-roc_auc_score': 0.7669682534913908}

		%% Cell type:markdown id: tags:

		Simple enough. Now let's see how to do the same thing with the Tensorflow APIs. Fair warning: this is going to take a lot more code!

		To begin with, Tensorflow doesn't allow a dataset to be passed directly to a model. Instead, you need to write an "input function" to construct a particular set of tensors and return them in a particular format. Fortunately, Dataset's `make_iterator()` method provides exactly the tensors we need in the form of a `tf.data.Iterator`. This allows our input function to be very simple.

		%% Cell type:code id: tags:

		```
		def input_fn(dataset, epochs):
		x, y, weights = dataset.make_iterator(batch_size=100, epochs=epochs).get_next()
		return {'x': x, 'weights': weights}, y
		```

		%% Cell type:markdown id: tags:

		Next, you have to use the functions in the `tf.feature_column` module to create an object representing each feature and weight column (but curiously, not the label column—don't ask me why!). These objects describe the data type and shape of each column, and give each one a name. The names must match the keys in the dict returned by the input function.

		%% Cell type:code id: tags:

		```
		x_col = tf.feature_column.numeric_column('x', shape=(n_features,))
		weight_col = tf.feature_column.numeric_column('weights', shape=(n_tasks,))
		```

		%% Cell type:markdown id: tags:

		Unlike DeepChem models, which allow arbitrary metrics to be passed to `evaluate()`, estimators require all metrics to be defined up front when you create the estimator. Unfortunately, Tensorflow doesn't have very good support for multitask models. It provides an AUC metric, but no easy way to average this metric over tasks. We therefore must create a separate metric for every task, then define our own metric function to compute the average of them.

		%% Cell type:code id: tags:

		```
		def mean_auc(labels, predictions, weights):
		metric_ops = []
		update_ops = []
		for i in range(n_tasks):
		metric, update = tf.metrics.auc(labels[:,i], predictions[:,i], weights[:,i])
		metric_ops.append(metric)
		update_ops.append(update)
		mean_metric = tf.reduce_mean(tf.stack(metric_ops))
		update_all = tf.group(*update_ops)
		return mean_metric, update_all
		```

		%% Cell type:markdown id: tags:

		Now we create our `Estimator` by calling `make_estimator()` on the DeepChem model. We provide as arguments the objects created above to represent the feature and weight columns, as well as our metric function.

		%% Cell type:code id: tags:

		```
		#estimator = model.make_estimator(feature_columns=[x_col],
		# weight_column=weight_col,
		# metrics={'mean_auc': mean_auc},
		# model_dir='estimator')
		# estimator = tf.keras.estimator.model_to_estimator(model)
		```

		%% Cell type:markdown id: tags:

		We are finally ready to train and evaluate it! Notice how the input function passed to each method is actually a lambda. This allows us to write a single function, then use it with different datasets and numbers of epochs.

		%% Cell type:code id: tags:

		```
		# estimator.train(input_fn=lambda: input_fn(train_dataset, 100))
		# print(estimator.evaluate(input_fn=lambda: input_fn(test_dataset, 1)))
		```

		%% Cell type:markdown id: tags:

		That's a lot of code for something DeepChem can do in three lines. The Tensorflow API is verbose and somewhat confusing. It has seemingly arbitrary limitations, like assuming a model will only ever have one output, and therefore only allowing one label. But for better or worse, it's a standard.

		Of course, if you just want to use a DeepChem model with a DeepChem dataset, there is no need for any of this. Just use the DeepChem API. But perhaps you want to use a DeepChem dataset with a model that has been implemented as an estimator. In that case, `Dataset.make_iterator()` allows you to easily do that. Or perhaps you have higher level workflow code that is written to work with estimators. In that case, `make_estimator()` allows DeepChem models to easily fit into that workflow.

Admin message