Merge pull request #1136 from peastman/tutorial (38e109de) · Commits · 钟慕尧 / deepchem

examples/notebooks/Estimators.ipynb

0 → 100644

+262 −0

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		Using DeepChem with Tensorflow Data and Estimators
		-----------------------------------------------

		When DeepChem was first created, Tensorflow had no standard interface for datasets or models. We created the Dataset and Model classes to fill this hole. More recently, Tensorflow has added the `tf.data` module as a standard interface for datasets, and the `tf.estimator` module as a standard interface for models. To enable easy interoperability with other tools, we have added features to Dataset and Model to support these new standards.

		This example demonstrates how to use these features. Let's begin by loading a dataset and creating a model to analyze it. We'll use a simple MultiTaskClassifier with one hidden layer.

		%% Cell type:code id: tags:

		``` python
		import deepchem as dc
		import tensorflow as tf
		import numpy as np

		tasks, datasets, transformers = dc.molnet.load_tox21()
		train_dataset, valid_dataset, test_dataset = datasets
		n_tasks = len(tasks)
		n_features = train_dataset.X.shape[1]

		model = dc.models.MultiTaskClassifier(n_tasks, n_features, layer_sizes=[1000], dropouts=0.25)
		```

		%% Output

		Loading dataset from disk.
		Loading dataset from disk.
		Loading dataset from disk.

		%% Cell type:markdown id: tags:

		We want to train the model using the training set, then evaluate it on the test set. As our evaluation metric we will use the ROC AUC, averaged over the 12 tasks included in the dataset. First let's see how to do this with the DeepChem API.

		%% Cell type:code id: tags:

		``` python
		model.fit(train_dataset, nb_epoch=100)
		metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean)
		print(model.evaluate(test_dataset, [metric]))
		```

		%% Output

		computed_metrics: [0.7081719239992621, 0.865129545326814, 0.836842947009276, 0.7701224617961064, 0.7081635765485087, 0.7911911911911912, 0.7207671300893743, 0.6592307518932563, 0.7976869777868352, 0.7409154581410679, 0.8243317675424011, 0.7112435328898743]
		{'mean-roc_auc_score': 0.7611497720178306}

		%% Cell type:markdown id: tags:

		Simple enough. Now let's see how to do the same thing with the Tensorflow APIs. Fair warning: this is going to take a lot more code!

		To begin with, Tensorflow doesn't allow a dataset to be passed directly to a model. Instead, you need to write an "input function" to construct a particular set of tensors and return them in a particular format. Fortunately, Dataset's `make_iterator()` method provides exactly the tensors we need in the form of a `tf.data.Iterator`. This allows our input function to be very simple.

		%% Cell type:code id: tags:

		``` python
		def input_fn(dataset, epochs):
		x, y, weights = dataset.make_iterator(batch_size=100, epochs=epochs).get_next()
		return {'x': x, 'weights': weights}, y
		```

		%% Cell type:markdown id: tags:

		Next, you have to use the functions in the `tf.feature_column` module to create an object representing each feature and weight column (but curiously, not the label column—don't ask me why!). These objects describe the data type and shape of each column, and give each one a name. The names must match the keys in the dict returned by the input function.

		%% Cell type:code id: tags:

		``` python
		x_col = tf.feature_column.numeric_column('x', shape=(n_features,))
		weight_col = tf.feature_column.numeric_column('weights', shape=(n_tasks,))
		```

		%% Cell type:markdown id: tags:

		Unlike DeepChem models, which allow arbitrary metrics to be passed to `evaluate()`, estimators require all metrics to be defined up front when you create the estimator. Unfortunately, Tensorflow doesn't have very good support for multitask models. It provides an AUC metric, but no easy way to average this metric over tasks. We therefore must create a separate metric for every task, then define our own metric function to compute the average of them.

		%% Cell type:code id: tags:

		``` python
		def mean_auc(labels, predictions, weights):
		metric_ops = []
		update_ops = []
		for i in range(n_tasks):
		metric, update = tf.metrics.auc(labels[:,i], predictions[:,i], weights[:,i])
		metric_ops.append(metric)
		update_ops.append(update)
		mean_metric = tf.reduce_mean(tf.stack(metric_ops))
		update_all = tf.group(*update_ops)
		return mean_metric, update_all
		```

		%% Cell type:markdown id: tags:

		Now we create our `Estimator` by calling `make_estimator()` on the DeepChem model. We provide as arguments the objects created above to represent the feature and weight columns, as well as our metric function.

		%% Cell type:code id: tags:

		``` python
		estimator = model.make_estimator(feature_columns=[x_col],
		weight_column=weight_col,
		metrics={'mean_auc': mean_auc},
		model_dir='estimator')
		```

		%% Output

		INFO:tensorflow:Using default config.
		INFO:tensorflow:Using config: {'_model_dir': 'estimator', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x12d39bef0>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

		%% Cell type:markdown id: tags:

		We are finally ready to train and evaluate it! Notice how the input function passed to each method is actually a lambda. This allows us to write a single function, then use it with different datasets and numbers of epochs.

		%% Cell type:code id: tags:

		``` python
		estimator.train(input_fn=lambda: input_fn(train_dataset, 100))
		print(estimator.evaluate(input_fn=lambda: input_fn(test_dataset, 1)))
		```

		%% Output

		INFO:tensorflow:Create CheckpointSaverHook.
		INFO:tensorflow:Saving checkpoints for 1 into estimator/model.ckpt.
		INFO:tensorflow:loss = 1716.8385, step = 1
		INFO:tensorflow:global_step/sec: 63.5117
		INFO:tensorflow:loss = 804.13416, step = 101 (1.576 sec)
		INFO:tensorflow:global_step/sec: 41.9688
		INFO:tensorflow:loss = 682.90265, step = 201 (2.383 sec)
		INFO:tensorflow:global_step/sec: 65.7841
		INFO:tensorflow:loss = 805.00336, step = 301 (1.520 sec)
		INFO:tensorflow:global_step/sec: 42.4989
		INFO:tensorflow:loss = 466.05975, step = 401 (2.353 sec)
		INFO:tensorflow:global_step/sec: 65.1989
		INFO:tensorflow:loss = 704.69446, step = 501 (1.534 sec)
		INFO:tensorflow:global_step/sec: 40.9395
		INFO:tensorflow:loss = 450.0899, step = 601 (2.443 sec)
		INFO:tensorflow:global_step/sec: 41.2419
		INFO:tensorflow:loss = 349.05804, step = 701 (2.424 sec)
		INFO:tensorflow:global_step/sec: 60.7104
		INFO:tensorflow:loss = 338.63837, step = 801 (1.647 sec)
		INFO:tensorflow:global_step/sec: 40.4742
		INFO:tensorflow:loss = 351.22452, step = 901 (2.471 sec)
		INFO:tensorflow:global_step/sec: 63.2702
		INFO:tensorflow:loss = 325.0889, step = 1001 (1.581 sec)
		INFO:tensorflow:global_step/sec: 42.8089
		INFO:tensorflow:loss = 334.04944, step = 1101 (2.336 sec)
		INFO:tensorflow:global_step/sec: 41.0803
		INFO:tensorflow:loss = 299.88806, step = 1201 (2.434 sec)
		INFO:tensorflow:global_step/sec: 62.1056
		INFO:tensorflow:loss = 301.3775, step = 1301 (1.610 sec)
		INFO:tensorflow:global_step/sec: 41.056
		INFO:tensorflow:loss = 345.18347, step = 1401 (2.436 sec)
		INFO:tensorflow:global_step/sec: 61.8764
		INFO:tensorflow:loss = 220.67397, step = 1501 (1.616 sec)
		INFO:tensorflow:global_step/sec: 39.5263
		INFO:tensorflow:loss = 232.79745, step = 1601 (2.529 sec)
		INFO:tensorflow:global_step/sec: 63.3011
		INFO:tensorflow:loss = 185.26181, step = 1701 (1.580 sec)
		INFO:tensorflow:global_step/sec: 40.9611
		INFO:tensorflow:loss = 188.3253, step = 1801 (2.441 sec)
		INFO:tensorflow:global_step/sec: 41.6845
		INFO:tensorflow:loss = 190.70108, step = 1901 (2.399 sec)
		INFO:tensorflow:global_step/sec: 64.629
		INFO:tensorflow:loss = 162.53293, step = 2001 (1.547 sec)
		INFO:tensorflow:global_step/sec: 42.2537
		INFO:tensorflow:loss = 161.35915, step = 2101 (2.367 sec)
		INFO:tensorflow:global_step/sec: 64.5807
		INFO:tensorflow:loss = 145.4078, step = 2201 (1.548 sec)
		INFO:tensorflow:global_step/sec: 41.9227
		INFO:tensorflow:loss = 120.111115, step = 2301 (2.385 sec)
		INFO:tensorflow:global_step/sec: 42.6787
		INFO:tensorflow:loss = 151.61371, step = 2401 (2.343 sec)
		INFO:tensorflow:global_step/sec: 63.3021
		INFO:tensorflow:loss = 136.98262, step = 2501 (1.580 sec)
		INFO:tensorflow:global_step/sec: 42.6588
		INFO:tensorflow:loss = 89.13097, step = 2601 (2.344 sec)
		INFO:tensorflow:global_step/sec: 64.7405
		INFO:tensorflow:loss = 101.06474, step = 2701 (1.545 sec)
		INFO:tensorflow:global_step/sec: 43.2237
		INFO:tensorflow:loss = 116.96815, step = 2801 (2.314 sec)
		INFO:tensorflow:global_step/sec: 43.4182
		INFO:tensorflow:loss = 84.83482, step = 2901 (2.303 sec)
		INFO:tensorflow:global_step/sec: 65.1042
		INFO:tensorflow:loss = 145.16194, step = 3001 (1.536 sec)
		INFO:tensorflow:global_step/sec: 42.3864
		INFO:tensorflow:loss = 92.99321, step = 3101 (2.359 sec)
		INFO:tensorflow:global_step/sec: 64.7556
		INFO:tensorflow:loss = 65.05712, step = 3201 (1.544 sec)
		INFO:tensorflow:global_step/sec: 42.6498
		INFO:tensorflow:loss = 78.92055, step = 3301 (2.345 sec)
		INFO:tensorflow:global_step/sec: 65.4527
		INFO:tensorflow:loss = 77.93735, step = 3401 (1.528 sec)
		INFO:tensorflow:global_step/sec: 42.8958
		INFO:tensorflow:loss = 57.02035, step = 3501 (2.332 sec)
		INFO:tensorflow:global_step/sec: 43.3849
		INFO:tensorflow:loss = 95.91443, step = 3601 (2.305 sec)
		INFO:tensorflow:global_step/sec: 65.1448
		INFO:tensorflow:loss = 75.03122, step = 3701 (1.535 sec)
		INFO:tensorflow:global_step/sec: 42.6941
		INFO:tensorflow:loss = 62.8435, step = 3801 (2.342 sec)
		INFO:tensorflow:global_step/sec: 65.2233
		INFO:tensorflow:loss = 45.883224, step = 3901 (1.533 sec)
		INFO:tensorflow:global_step/sec: 43.4815
		INFO:tensorflow:loss = 57.56656, step = 4001 (2.300 sec)
		INFO:tensorflow:global_step/sec: 41.5674
		INFO:tensorflow:loss = 70.33858, step = 4101 (2.406 sec)
		INFO:tensorflow:global_step/sec: 58.5978
		INFO:tensorflow:loss = 67.34745, step = 4201 (1.707 sec)
		INFO:tensorflow:global_step/sec: 39.8156
		INFO:tensorflow:loss = 46.03079, step = 4301 (2.511 sec)
		INFO:tensorflow:global_step/sec: 60.5059
		INFO:tensorflow:loss = 40.959454, step = 4401 (1.653 sec)
		INFO:tensorflow:global_step/sec: 41.7228
		INFO:tensorflow:loss = 36.393044, step = 4501 (2.397 sec)
		INFO:tensorflow:global_step/sec: 42.4976
		INFO:tensorflow:loss = 46.14415, step = 4601 (2.353 sec)
		INFO:tensorflow:global_step/sec: 66.1396
		INFO:tensorflow:loss = 41.93784, step = 4701 (1.512 sec)
		INFO:tensorflow:global_step/sec: 42.5402
		INFO:tensorflow:loss = 29.39001, step = 4801 (2.351 sec)
		INFO:tensorflow:global_step/sec: 65.0227
		INFO:tensorflow:loss = 29.608704, step = 4901 (1.538 sec)
		INFO:tensorflow:global_step/sec: 43.692
		INFO:tensorflow:loss = 43.265915, step = 5001 (2.289 sec)
		INFO:tensorflow:global_step/sec: 65.8827
		INFO:tensorflow:loss = 41.69668, step = 5101 (1.518 sec)
		INFO:tensorflow:global_step/sec: 42.4384
		INFO:tensorflow:loss = 28.208687, step = 5201 (2.356 sec)
		INFO:tensorflow:global_step/sec: 42.4864
		INFO:tensorflow:loss = 34.643417, step = 5301 (2.354 sec)
		INFO:tensorflow:global_step/sec: 66.223
		INFO:tensorflow:loss = 46.616447, step = 5401 (1.510 sec)
		INFO:tensorflow:global_step/sec: 42.0575
		INFO:tensorflow:loss = 42.339645, step = 5501 (2.378 sec)
		INFO:tensorflow:global_step/sec: 65.1812
		INFO:tensorflow:loss = 94.012146, step = 5601 (1.534 sec)
		INFO:tensorflow:global_step/sec: 43.1405
		INFO:tensorflow:loss = 25.879742, step = 5701 (2.318 sec)
		INFO:tensorflow:global_step/sec: 43.209
		INFO:tensorflow:loss = 35.351685, step = 5801 (2.314 sec)
		INFO:tensorflow:global_step/sec: 65.5692
		INFO:tensorflow:loss = 12.110611, step = 5901 (1.525 sec)
		INFO:tensorflow:global_step/sec: 42.3864
		INFO:tensorflow:loss = 19.612688, step = 6001 (2.359 sec)
		INFO:tensorflow:global_step/sec: 65.1961
		INFO:tensorflow:loss = 31.003126, step = 6101 (1.534 sec)
		INFO:tensorflow:global_step/sec: 42.9087
		INFO:tensorflow:loss = 21.030697, step = 6201 (2.330 sec)
		INFO:tensorflow:Saving checkpoints for 6300 into estimator/model.ckpt.
		INFO:tensorflow:Loss for final step: 19.216248.
		INFO:tensorflow:Starting evaluation at 2018-03-06-22:55:47
		INFO:tensorflow:Restoring parameters from estimator/model.ckpt-6300
		INFO:tensorflow:Finished evaluation at 2018-03-06-22:55:49
		INFO:tensorflow:Saving dict for global step 6300: global_step = 6300, loss = 6348.7153, mean_auc = 0.7047531
		{'loss': 6348.7153, 'mean_auc': 0.7047531, 'global_step': 6300}

		%% Cell type:markdown id: tags:

		That's a lot of code for something DeepChem can do in three lines. The Tensorflow API is verbose and somewhat confusing. It has seemingly arbitrary limitations, like assuming a model will only ever have one output, and therefore only allowing one label. But for better or worse, it's a standard.

		Of course, if you just want to use a DeepChem model with a DeepChem dataset, there is no need for any of this. Just use the DeepChem API. But perhaps you want to use a DeepChem dataset with a model that has been implemented as an estimator. In that case, `Dataset.make_iterator()` allows you to easily do that. Or perhaps you have higher level workflow code that is written to work with estimators. In that case, `make_estimator()` allows DeepChem models to easily fit into that workflow.

Admin message