Merge branch 'master' into json_loaders (5acb2308) · Commits · 钟慕尧 / deepchem

README.md

+36 −29

Original line number	Diff line number	Diff line
		@@ -18,7 +18,7 @@ materials science, quantum chemistry, and biology.
		- [Install latest package with conda](#install-via-conda-recommendation)
		- [Install latest package with pip (WIP)](#install-via-pip-wip)
		- [Install from source](#install-from-source)
		- [Install using a Docker (WIP)](#install-using-a-docker-wip)
		- [Install using a Docker](#install-using-a-docker)
		- [FAQ and Troubleshooting](#faq-and-troubleshooting)
		- [Getting Started](#getting-started)
		- [Contributing to DeepChem](/CONTRIBUTING.md)
		@@ -47,12 +47,14 @@ DeepChem has a number of "soft" requirements. These are packages which are neede

		- [BioPython](https://biopython.org/wiki/Documentation)
		- [OpenAI Gym](https://gym.openai.com/)
		- [matminer](https://hackingmaterials.lbl.gov/matminer/)
		- [MDTraj](http://mdtraj.org/)
		- [NetworkX](https://networkx.github.io/documentation/stable/index.html)
		- [OpenMM](http://openmm.org/)
		- [PDBFixer](https://github.com/pandegroup/pdbfixer)
		- [Pillow](https://pypi.org/project/Pillow/)
		- [pyGPGO](https://pygpgo.readthedocs.io/en/latest/)
		- [Pymatgen](https://pymatgen.org/)
		- [PyTorch](https://pytorch.org/)
		- [RDKit](http://www.rdkit.org/docs/Install.html)
		- [simdna](https://github.com/kundajelab/simdna)
		@@ -139,53 +141,58 @@ pytest -m "not slow" deepchem # optional

		Check [this link](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) for more information about the installation of conda environments.

		### Install using a Docker (WIP)
		### Install using a Docker

		### Build the image from Dockerfile
		If you want to install using a docker, you can pull two kinds of images.
		DockerHub : https://hub.docker.com/repository/docker/deepchemio/deepchem

		We created [sample Dockerfiles](https://github.com/deepchem/deepchem/tree/master/docker) based on the `nvidia/cuda:10.1-cudnn7-devel` image.
		If you want to build your own deepchem environment, these files may be helpful.
		- `docker/x.x.x` : build an image by using conda package manager (x.x.x is a version of deepchem)
		- `docker/master` : build an image from master branch of deepchem source codes
		- `deepchemio/deepchem:x.x.x`
		- Image built by using a conda package manager (x.x.x is a version of deepchem)
		- The x.x.x image is built when we push x.x.x. tag
		- Dockerfile is put in `docker/conda-forge` directory
		- `deepchemio/deepchem:latest`
		- Image built by the master branch of deepchem source codes
		- The latest image is built every time we commit to the master branch
		- Dockerfile is put in `docker/master` directory

		### Use the official deepchem image (WIP)

		We couldn't check if this introduction works well or not.

		First, you pull the latest stable deepchem docker image.
		First, you pull the image you want to use.

		```bash
		docker pull deepchemio/deepchem
		docker pull deepchemio/deepchem:2.3.0
		```

		Then, you create a container based on our latest image.
		Then, you create a container based on the image.

		```bash
		docker run -it deepchemio/deepchem
		docker run --rm -it deepchemio/deepchem:2.3.0
		```

		If you want GPU support:

		```bash
		# If nvidia-docker is installed
		nvidia-docker run -it deepchemio/deepchem
		docker run --runtime nvidia -it deepchemio/deepchem
		nvidia-docker run --rm -it deepchemio/deepchem:2.3.0
		docker run --runtime nvidia --rm -it deepchemio/deepchem:2.3.0

		# If nvidia-container-toolkit is installed
		docker run --gpus all -it deepchemio/deepchem
		docker run --gpus all --rm -it deepchemio/deepchem:2.3.0
		```

		You are now in a docker container whose python has deepchem installed.
		You are now in a docker container which deepchem was installed. You can start playing with it in the command line.

		```bash
		# you can start playing with it in the command line
		pip install jupyter
		ipython
		import deepchem as dc
		```
		(deepchem) root@xxxxxxxxxxxxx:~/mydir# python
		Python 3.6.10 \|Anaconda, Inc.\| (default, May 8 2020, 02:54:21)
		[GCC 7.3.0] on linux
		Type "help", "copyright", "credits" or "license" for more information.
		>>> import deepchem as dc
		```

		If you want to check the tox21 benchmark:

		# you can run our tox21 benchmark
		cd /deepchem/examples
		python benchmark.py -d tox21
		```bash
		(deepchem) root@xxxxxxxxxxxxx:~/mydir# wget https://raw.githubusercontent.com/deepchem/deepchem/master/examples/benchmark.py
		(deepchem) root@xxxxxxxxxxxxx:~/mydir# python benchmark.py -d tox21 -m graphconv -s random
		```

		## FAQ and Troubleshooting
		@@ -209,7 +216,7 @@ sudo apt-get install -y libxrender-dev

		## Getting Started

		The DeepChem project maintains an extensive colelction of [tutorials](https://github.com/deepchem/deepchem/tree/master/examples/tutorials). All tutorials are designed to be run on Google colab (or locally if you prefer). Tutorials are arranged in a suggested learning sequence which will take you from beginner to proficient at molecular machine learning and computational biology more broadly.
		The DeepChem project maintains an extensive collection of [tutorials](https://github.com/deepchem/deepchem/tree/master/examples/tutorials). All tutorials are designed to be run on Google colab (or locally if you prefer). Tutorials are arranged in a suggested learning sequence which will take you from beginner to proficient at molecular machine learning and computational biology more broadly.

		After working through the tutorials, you can also go through other [examples](https://github.com/deepchem/deepchem/tree/master/examples). To apply `deepchem` to a new problem, try starting from one of the existing examples or tutorials and modifying it step by step to work with your new use-case. If you have questions or comments you can raise them on our [gitter](https://gitter.im/deepchem/Lobby).

deepchem/data/datasets.py

+141 −140

Original line number	Diff line number	Diff line
		@@ -336,7 +336,7 @@ class Dataset(object):

		def iterbatches(self,
		batch_size=None,
		epoch=0,
		epochs=1,
		deterministic=False,
		pad_batches=False):
		"""Get an object that iterates over minibatches from the dataset.
		@@ -348,7 +348,7 @@ class Dataset(object):
		----------
		batch_size: int, optional
		Number of elements in each batch
		epoch: int, optional
		epochs: int, optional
		Number of epochs to walk over dataset
		deterministic: bool, optional
		If True, follow deterministic order.
		@@ -485,8 +485,7 @@ class Dataset(object):
		# Create a Tensorflow Dataset.

		def gen_data():
		for epoch in range(epochs):
		for X, y, w, ids in self.iterbatches(batch_size, epoch, deterministic,
		for X, y, w, ids in self.iterbatches(batch_size, epochs, deterministic,
		pad_batches):
		yield (X, y, w)

		@@ -727,7 +726,7 @@ class NumpyDataset(Dataset):

		def iterbatches(self,
		batch_size=None,
		epoch=0,
		epochs=1,
		deterministic=False,
		pad_batches=False):
		"""Get an object that iterates over minibatches from the dataset.
		@@ -739,7 +738,7 @@ class NumpyDataset(Dataset):
		----------
		batch_size: int, optional
		Number of elements in each batch
		epoch: int, optional
		epochs: int, optional
		Number of epochs to walk over dataset
		deterministic: bool, optional
		If True, follow deterministic order.
		@@ -751,14 +750,15 @@ class NumpyDataset(Dataset):
		Generator which yields tuples of four numpy arrays `(X, y, w, ids)`
		"""

		def iterate(dataset, batch_size, deterministic, pad_batches):
		def iterate(dataset, batch_size, epochs, deterministic, pad_batches):
		n_samples = dataset._X.shape[0]
		if not deterministic:
		sample_perm = np.random.permutation(n_samples)
		else:
		if deterministic:
		sample_perm = np.arange(n_samples)
		if batch_size is None:
		batch_size = n_samples
		for epoch in range(epochs):
		if not deterministic:
		sample_perm = np.random.permutation(n_samples)
		batch_idx = 0
		num_batches = np.math.ceil(n_samples / batch_size)
		while batch_idx < num_batches:
		@@ -776,7 +776,7 @@ class NumpyDataset(Dataset):
		batch_idx += 1
		yield (X_batch, y_batch, w_batch, ids_batch)

		return iterate(self, batch_size, deterministic, pad_batches)
		return iterate(self, batch_size, epochs, deterministic, pad_batches)

		def itersamples(self):
		"""Get an object that iterates over the samples in the dataset.
		@@ -1151,7 +1151,7 @@ class DiskDataset(Dataset):

		def iterbatches(self,
		batch_size=None,
		epoch=0,
		epochs=1,
		deterministic=False,
		pad_batches=False):
		""" Get an object that iterates over minibatches from the dataset.
		@@ -1166,7 +1166,7 @@ class DiskDataset(Dataset):
		Number of elements in a batch. If None, then it yields batches
		with size equal to the size of each individual shard.
		epoch: int
		Not used
		Number of epochs to walk over dataset
		deterministic: bool
		Whether or not we should should shuffle each shard before
		generating the batches. Note that this is only local in the
		@@ -1176,21 +1176,20 @@ class DiskDataset(Dataset):
		it has exactly batch_size elements.
		"""
		shard_indices = list(range(self.get_number_shards()))
		return self._iterbatches_from_shards(shard_indices, batch_size,
		return self._iterbatches_from_shards(shard_indices, batch_size, epochs,
		deterministic, pad_batches)

		def _iterbatches_from_shards(self,
		shard_indices,
		batch_size=None,
		epochs=1,
		deterministic=False,
		pad_batches=False):
		"""Get an object that iterates over batches from a restricted set of shards."""

		def iterate(dataset, batch_size):
		def iterate(dataset, batch_size, epochs):
		num_shards = len(shard_indices)
		if not deterministic:
		shard_perm = np.random.permutation(num_shards)
		else:
		if deterministic:
		shard_perm = np.arange(num_shards)

		# (ytz): Depending on the application, thread-based pools may be faster
		@@ -1198,16 +1197,17 @@ class DiskDataset(Dataset):
		# objects as an extra overhead. Also, as hideously as un-thread safe this looks,
		# we're actually protected by the GIL.
		pool = Pool(1) # mp.dummy aliases ThreadPool to Pool
		next_shard = pool.apply_async(dataset.get_shard,
		(shard_indices[shard_perm[0]],))

		total_yield = 0

		if batch_size is None:
		num_global_batches = num_shards
		else:
		num_global_batches = math.ceil(dataset.get_shape()[0][0] / batch_size)

		for epoch in range(epochs):
		if not deterministic:
		shard_perm = np.random.permutation(num_shards)
		next_shard = pool.apply_async(dataset.get_shard,
		(shard_indices[shard_perm[0]],))
		cur_global_batch = 0
		cur_shard = 0
		carry = None
		@@ -1218,7 +1218,7 @@ class DiskDataset(Dataset):
		if cur_shard < num_shards - 1:
		next_shard = pool.apply_async(
		dataset.get_shard, (shard_indices[shard_perm[cur_shard + 1]],))
		else:
		elif epoch == epochs - 1:
		pool.close()

		if carry is not None:
		@@ -1285,7 +1285,7 @@ class DiskDataset(Dataset):
		cur_local_batch += 1
		cur_shard += 1

		return iterate(self, batch_size)
		return iterate(self, batch_size, epochs)

		def itersamples(self):
		"""Get an object that iterates over the samples in the dataset.
		@@ -1922,7 +1922,7 @@ class ImageDataset(Dataset):

		def iterbatches(self,
		batch_size=None,
		epoch=0,
		epochs=1,
		deterministic=False,
		pad_batches=False):
		"""Get an object that iterates over minibatches from the dataset.
		@@ -1931,14 +1931,15 @@ class ImageDataset(Dataset):
		w, ids).
		"""

		def iterate(dataset, batch_size, deterministic, pad_batches):
		def iterate(dataset, batch_size, epochs, deterministic, pad_batches):
		n_samples = dataset._X_shape[0]
		if not deterministic:
		sample_perm = np.random.permutation(n_samples)
		else:
		if deterministic:
		sample_perm = np.arange(n_samples)
		if batch_size is None:
		batch_size = n_samples
		for epoch in range(epochs):
		if not deterministic:
		sample_perm = np.random.permutation(n_samples)
		batch_idx = 0
		num_batches = np.math.ceil(n_samples / batch_size)
		while batch_idx < num_batches:
		@@ -1964,7 +1965,7 @@ class ImageDataset(Dataset):
		batch_idx += 1
		yield (X_batch, y_batch, w_batch, ids_batch)

		return iterate(self, batch_size, deterministic, pad_batches)
		return iterate(self, batch_size, epochs, deterministic, pad_batches)

		def itersamples(self):
		"""Get an object that iterates over the samples in the dataset.
		@@ -2143,7 +2144,7 @@ class Databag(object):
		----------
		batch_size: int
		Number of samples from each dataset to return
		epoch: int
		epochs: int
		Number of times to loop through the datasets
		pad_batches: boolean
		Should all batches==batch_size

deepchem/data/tests/test_datasets.py

+4 −4

Original line number	Diff line number	Diff line
		@@ -436,9 +436,9 @@ class TestDatasets(test_util.TensorFlowTestCase):
		solubility_dataset.w, solubility_dataset.ids)
		batch_sizes = []
		for X, y, _, _ in solubility_dataset.iterbatches(
		3, pad_batches=False, deterministic=True):
		3, epochs=2, pad_batches=False, deterministic=True):
		batch_sizes.append(len(X))
		self.assertEqual([3, 3, 3, 1], batch_sizes)
		self.assertEqual([3, 3, 3, 1, 3, 3, 3, 1], batch_sizes)

		def test_disk_pad_batches(self):
		shard_sizes = [21, 11, 41, 21, 51]
		@@ -663,9 +663,9 @@ class TestDatasets(test_util.TensorFlowTestCase):
		solubility_dataset)
		batch_sizes = []
		for X, y, _, _ in solubility_dataset.iterbatches(
		3, pad_batches=False, deterministic=True):
		3, epochs=2, pad_batches=False, deterministic=True):
		batch_sizes.append(len(X))
		self.assertEqual([3, 3, 3, 1], batch_sizes)
		self.assertEqual([3, 3, 3, 1, 3, 3, 3, 1], batch_sizes)

		def test_merge(self):
		"""Test that dataset merge works."""

deepchem/data/tests/test_image_dataset.py

+6 −3

Original line number	Diff line number	Diff line
		@@ -74,15 +74,18 @@ class TestImageDataset(test_util.TensorFlowTestCase):
		ds = dc.data.ImageDataset(files, np.random.random(10))
		X = ds.X
		iterated_ids = set()
		for x, y, w, ids in ds.iterbatches(2):
		for x, y, w, ids in ds.iterbatches(2, epochs=2):
		np.testing.assert_array_equal([2, 28, 28], x.shape)
		np.testing.assert_array_equal([2], y.shape)
		np.testing.assert_array_equal([2], w.shape)
		np.testing.assert_array_equal([2], ids.shape)
		for i in (0, 1):
		assert ids[i] in files
		if len(iterated_ids) < 10:
		assert ids[i] not in iterated_ids
		iterated_ids.add(ids[i])
		else:
		assert ids[i] in iterated_ids
		index = files.index(ids[i])
		np.testing.assert_array_equal(x[i], X[index])
		assert len(iterated_ids) == 10

deepchem/dock/pose_generation.py

+38 −23

Original line number	Diff line number	Diff line
		@@ -7,6 +7,7 @@ import logging
		import numpy as np
		import os
		import tempfile
		import tarfile
		from subprocess import call
		from deepchem.utils.rdkit_util import add_hydrogens_to_mol
		from subprocess import check_output
		@@ -14,6 +15,7 @@ from deepchem.utils import rdkit_util
		from deepchem.utils import mol_xyz_util
		from deepchem.utils import geometry_utils
		from deepchem.utils import vina_utils
		from deepchem.utils import download_url

		logger = logging.getLogger(__name__)

		@@ -105,6 +107,8 @@ class VinaPoseGenerator(PoseGenerator):
		url = "http://vina.scripps.edu/download/autodock_vina_1_1_2_linux_x86.tgz"
		filename = "autodock_vina_1_1_2_linux_x86.tgz"
		dirname = "autodock_vina_1_1_2_linux_x86"
		self.vina_dir = os.path.join(data_dir, dirname)
		self.vina_cmd = os.path.join(self.vina_dir, "bin/vina")
		elif platform.system() == 'Darwin':
		if sixty_four_bits:
		url = "http://vina.scripps.edu/download/autodock_vina_1_1_2_mac_64bit.tar.gz"
		@@ -114,26 +118,31 @@ class VinaPoseGenerator(PoseGenerator):
		url = "http://vina.scripps.edu/download/autodock_vina_1_1_2_mac.tgz"
		filename = "autodock_vina_1_1_2_mac.tgz"
		dirname = "autodock_vina_1_1_2_mac"
		self.vina_dir = os.path.join(data_dir, dirname)
		self.vina_cmd = os.path.join(self.vina_dir, "bin/vina")
		elif platform.system() == 'Windows':
		url = "http://vina.scripps.edu/download/autodock_vina_1_1_2_win32.msi"
		filename = "autodock_vina_1_1_2_win32.msi"
		self.vina_dir = "\\Program Files (x86)\\The Scripps Research Institute\\Vina"
		self.vina_cmd = os.path.join(self.vina_dir, "vina.exe")
		else:
		raise ValueError(
		"This class can only run on Linux or Mac. If you are on Windows, please try using a cloud platform to run this code instead."
		"Unknown operating system. Try using a cloud platform to run this code instead."
		)
		self.vina_dir = os.path.join(data_dir, dirname)
		self.pocket_finder = pocket_finder
		if not os.path.exists(self.vina_dir):
		logger.info("Vina not available. Downloading")
		wget_cmd = "wget -nv -c -T 15 %s" % url
		check_output(wget_cmd.split())
		download_url(url, data_dir)
		downloaded_file = os.path.join(data_dir, filename)
		logger.info("Downloaded Vina. Extracting")
		untar_cmd = "tar -xzvf %s" % filename
		check_output(untar_cmd.split())
		logger.info("Moving to final location")
		mv_cmd = "mv %s %s" % (dirname, data_dir)
		check_output(mv_cmd.split())
		if platform.system() == 'Windows':
		msi_cmd = "msiexec /i %s" % downloaded_file
		check_output(msi_cmd.split())
		else:
		with tarfile.open(downloaded_file) as tar:
		tar.extractall(data_dir)
		logger.info("Cleanup: removing downloaded vina tar.gz")
		rm_cmd = "rm %s" % filename
		call(rm_cmd.split())
		self.vina_cmd = os.path.join(self.vina_dir, "bin/vina")
		os.remove(downloaded_file)

		def generate_poses(self,
		molecular_complex,
		@@ -207,6 +216,8 @@ class VinaPoseGenerator(PoseGenerator):
		protein_pdbqt = os.path.join(out_dir, "%s.pdbqt" % protein_name)
		protein_mol = rdkit_util.load_molecule(
		protein_file, calc_charges=True, add_hydrogens=True)
		rdkit_util.write_molecule(protein_mol[1], protein_hyd, is_protein=True)
		rdkit_util.write_molecule(protein_mol[1], protein_pdbqt, is_protein=True)

		# Get protein centroid and range
		if centroid is not None and box_dims is not None:
		@@ -215,9 +226,6 @@ class VinaPoseGenerator(PoseGenerator):
		else:
		if self.pocket_finder is None:
		logger.info("Pockets not specified. Will use whole protein to dock")
		rdkit_util.write_molecule(protein_mol[1], protein_hyd, is_protein=True)
		rdkit_util.write_molecule(
		protein_mol[1], protein_pdbqt, is_protein=True)
		protein_centroid = geometry_utils.compute_centroid(protein_mol[0])
		protein_range = mol_xyz_util.get_molecule_range(protein_mol[0])
		box_dims = protein_range + 5.0
		@@ -276,10 +284,17 @@ class VinaPoseGenerator(PoseGenerator):
		log_file = os.path.join(out_dir, "%s_log.txt" % ligand_name)
		out_pdbqt = os.path.join(out_dir, "%s_docked.pdbqt" % ligand_name)
		logger.info("About to call Vina")
		call(
		"%s --config %s --log %s --out %s" % (self.vina_cmd, conf_file,
		log_file, out_pdbqt),
		shell=True)
		if platform.system() == 'Windows':
		args = [
		self.vina_cmd, "--config", conf_file, "--log", log_file, "--out",
		out_pdbqt
		]
		else:
		# I'm not sure why specifying the args as a list fails on other platforms,
		# but for some reason it only works if I pass it as a string.
		args = "%s --config %s --log %s --out %s" % (self.vina_cmd, conf_file,
		log_file, out_pdbqt)
		call(args, shell=True)
		ligands, scores = vina_utils.load_docked_ligands(out_pdbqt)
		docked_complexes += [(protein_mol[1], ligand) for ligand in ligands]
		all_scores += scores

Admin message