Ran tutorial for full number of epochs on a fast GPU (871d65c4) · Commits · 钟慕尧 / deepchem

examples/tutorials/16_Learning_Unsupervised_Embeddings_for_Molecules.ipynb

+7 −37

Original line number	Diff line number	Diff line
		%% Cell type:markdown id: tags:

		# Tutorial Part 16: Learning Unsupervised Embeddings for Molecules

		In this tutorial, we will use a `SeqToSeq` model to generate fingerprints for classifying molecules. This is based on the following paper, although some of the implementation details are different: Xu et al., "Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery" (https://doi.org/10.1145/3107411.3107424).

		## Colab

		This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

		[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/16_Learning_Unsupervised_Embeddings_for_Molecules.ipynb)

		## Setup

		To run DeepChem within Colab, you'll need to run the following installation commands. This will take about 5 minutes to run to completion and install your environment. You can of course run this tutorial locally if you prefer. In that case, don't run these cells since they will download and install Anaconda on your local machine. This notebook will take a few hours to run on a GPU, so we encourage you to run it on Google colab unless you have a good GPU machine available.
		To run DeepChem within Colab, you'll need to run the following installation commands. This will take about 5 minutes to run to completion and install your environment. You can of course run this tutorial locally if you prefer. In that case, don't run these cells since they will download and install Anaconda on your local machine. This notebook can take up to a few hours to run on a GPU, so we encourage you to run it on Google colab unless you have a good GPU machine available.

		%% Cell type:code id: tags:

		``` python
		!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
		import conda_installer
		conda_installer.install()
		!/root/miniconda/bin/conda info -e
		```

		%% Output

		% Total % Received % Xferd Average Speed Time Time Time Current
		Dload Upload Total Spent Left Speed

		0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
		100 3489 100 3489 0 0 8209 0 --:--:-- --:--:-- --:--:-- 8209

		add /root/miniconda/lib/python3.6/site-packages to PYTHONPATH
		all packages is already installed

		# conda environments:
		#
		base * /root/miniconda


		%% Cell type:code id: tags:

		``` python
		!pip install --pre deepchem
		import deepchem
		deepchem.__version__
		```

		%% Output

		Requirement already satisfied: deepchem in /usr/local/lib/python3.6/dist-packages (2.4.0rc1.dev20200805143219)
		Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from deepchem) (0.22.2.post1)
		Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.0.5)
		Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from deepchem) (0.16.0)
		Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.18.5)
		Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.4.1)
		Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->deepchem) (2018.9)
		Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas->deepchem) (2.8.1)
		Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas->deepchem) (1.15.0)

		'2.4.0-rc1.dev'

		%% Cell type:markdown id: tags:

		# Learning Embeddings with SeqToSeq

		Many types of models require their inputs to have a fixed shape. Since molecules can vary widely in the numbers of atoms and bonds they contain, this makes it hard to apply those models to them. We need a way of generating a fixed length "fingerprint" for each molecule. Various ways of doing this have been designed, such as the Extended-Connectivity Fingerprints (ECFPs) we used in earlier tutorials. But in this example, instead of designing a fingerprint by hand, we will let a `SeqToSeq` model learn its own method of creating fingerprints.

		A `SeqToSeq` model performs sequence to sequence translation. For example, they are often used to translate text from one language to another. It consists of two parts called the "encoder" and "decoder". The encoder is a stack of recurrent layers. The input sequence is fed into it, one token at a time, and it generates a fixed length vector called the "embedding vector". The decoder is another stack of recurrent layers that performs the inverse operation: it takes the embedding vector as input, and generates the output sequence. By training it on appropriately chosen input/output pairs, you can create a model that performs many sorts of transformations.

		In this case, we will use SMILES strings describing molecules as the input sequences. We will train the model as an autoencoder, so it tries to make the output sequences identical to the input sequences. For that to work, the encoder must create embedding vectors that contain all information from the original sequence. That's exactly what we want in a fingerprint, so perhaps those embedding vectors will then be useful as a way to represent molecules in other models!

		Let's start by loading the data. We will use the MUV dataset. It includes 74,501 molecules in the training set, and 9313 molecules in the validation set, so it gives us plenty of SMILES strings to work with.

		%% Cell type:code id: tags:

		``` python
		import deepchem as dc
		tasks, datasets, transformers = dc.molnet.load_muv(split='stratified')
		train_dataset, valid_dataset, test_dataset = datasets
		train_smiles = train_dataset.ids
		valid_smiles = valid_dataset.ids
		```

		%% Cell type:markdown id: tags:

		We need to define the "alphabet" for our `SeqToSeq` model, the list of all tokens that can appear in sequences. (It's also possible for input and output sequences to have different alphabets, but since we're training it as an autoencoder, they're identical in this case.) Make a list of every character that appears in any training sequence.

		%% Cell type:code id: tags:

		``` python
		tokens = set()
		for s in train_smiles:
		tokens = tokens.union(set(c for c in s))
		tokens = sorted(list(tokens))
		```

		%% Cell type:markdown id: tags:

		Create the model and define the optimization method to use. In this case, learning works much better if we gradually decrease the learning rate. We use an `ExponentialDecay` to multiply the learning rate by 0.9 after each epoch.

		%% Cell type:code id: tags:

		``` python
		from deepchem.models.optimizers import Adam, ExponentialDecay
		max_length = max(len(s) for s in train_smiles)
		batch_size = 100
		batches_per_epoch = len(train_smiles)/batch_size
		model = dc.models.SeqToSeq(tokens,
		tokens,
		max_length,
		encoder_layers=2,
		decoder_layers=2,
		embedding_dimension=256,
		model_dir='fingerprint',
		batch_size=batch_size,
		learning_rate=ExponentialDecay(0.004, 0.9, batches_per_epoch))
		learning_rate=ExponentialDecay(0.001, 0.9, batches_per_epoch))
		```

		%% Cell type:markdown id: tags:

		Let's train it! The input to `fit_sequences()` is a generator that produces input/output pairs. On a good GPU, this should take a few hours or less.

		%% Cell type:code id: tags:

		``` python
		def generate_sequences(epochs):
		for i in range(epochs):
		for s in train_smiles:
		yield (s, s)

		model.fit_sequences(generate_sequences(1))#40
		model.fit_sequences(generate_sequences(40))
		```

		%% Cell type:markdown id: tags:

		Let's see how well it works as an autoencoder. We'll run the first 500 molecules from the validation set through it, and see how many of them are exactly reproduced.

		%% Cell type:code id: tags:

		``` python
		predicted = model.predict_from_sequences(valid_smiles[:500])
		count = 0
		for s,p in zip(valid_smiles[:500], predicted):
		if ''.join(p) == s:
		count += 1
		print('reproduced', count, 'of 500 validation SMILES strings')
		```

		%% Output

		reproduced 0 of 500 validation SMILES strings
		reproduced 161 of 500 validation SMILES strings

		%% Cell type:markdown id: tags:

		Now we'll trying using the encoder as a way to generate molecular fingerprints. We compute the embedding vectors for all molecules in the training and validation datasets, and create new datasets that have those as their feature vectors. The amount of data is small enough that we can just store everything in memory.

		%% Cell type:code id: tags:

		``` python
		import numpy as np
		train_embeddings = model.predict_embeddings(train_smiles)
		train_embeddings_dataset = dc.data.NumpyDataset(train_embeddings,
		train_dataset.y,
		train_dataset.w.astype(np.float32),
		train_dataset.ids)

		valid_embeddings = model.predict_embeddings(valid_smiles)
		valid_embeddings_dataset = dc.data.NumpyDataset(valid_embeddings,
		valid_dataset.y,
		valid_dataset.w.astype(np.float32),
		valid_dataset.ids)
		```

		%% Cell type:markdown id: tags:

		For classification, we'll use a simple fully connected network with one hidden layer.

		%% Cell type:code id: tags:

		``` python
		classifier = dc.models.MultitaskClassifier(n_tasks=len(tasks),
		n_features=256,
		layer_sizes=[512])
		classifier.fit(train_embeddings_dataset, nb_epoch=10)
		```

		%% Output

		0.002357203811407089
		0.0014195525646209716

		%% Cell type:markdown id: tags:

		Find out how well it worked. Compute the ROC AUC for the training and validation datasets.

		%% Cell type:code id: tags:

		``` python
		metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean, mode="classification")
		train_score = classifier.evaluate(train_embeddings_dataset, [metric], transformers)
		valid_score = classifier.evaluate(valid_embeddings_dataset, [metric], transformers)
		print('Training set ROC AUC:', train_score)
		print('Validation set ROC AUC:', valid_score)
		```

		%% Output

		Training set ROC AUC: {'mean-roc_auc_score': 0.8140473860164172}
		Validation set ROC AUC: {'mean-roc_auc_score': 0.6620464489144489}
		Training set ROC AUC: {'mean-roc_auc_score': 0.9598792603154332}
		Validation set ROC AUC: {'mean-roc_auc_score': 0.7251350862464794}

		%% Cell type:markdown id: tags:

		# Congratulations! Time to join the Community!

		Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

		## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
		This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

		## Join the DeepChem Gitter
		The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!

Admin message