Commit 5acb2308 authored by Nathan Frey's avatar Nathan Frey
Browse files

Merge branch 'master' into json_loaders

parents d6bd5944 f985a333
Loading
Loading
Loading
Loading
+36 −29
Original line number Diff line number Diff line
@@ -18,7 +18,7 @@ materials science, quantum chemistry, and biology.
  - [Install latest package with conda](#install-via-conda-recommendation)
  - [Install latest package with pip (WIP)](#install-via-pip-wip)
  - [Install from source](#install-from-source)
  - [Install using a Docker (WIP)](#install-using-a-docker-wip)
  - [Install using a Docker](#install-using-a-docker)
- [FAQ and Troubleshooting](#faq-and-troubleshooting)
- [Getting Started](#getting-started)
- [Contributing to DeepChem](/CONTRIBUTING.md)
@@ -47,12 +47,14 @@ DeepChem has a number of "soft" requirements. These are packages which are neede

- [BioPython](https://biopython.org/wiki/Documentation)
- [OpenAI Gym](https://gym.openai.com/)
- [matminer](https://hackingmaterials.lbl.gov/matminer/)
- [MDTraj](http://mdtraj.org/)
- [NetworkX](https://networkx.github.io/documentation/stable/index.html)
- [OpenMM](http://openmm.org/)
- [PDBFixer](https://github.com/pandegroup/pdbfixer)
- [Pillow](https://pypi.org/project/Pillow/)
- [pyGPGO](https://pygpgo.readthedocs.io/en/latest/)
- [Pymatgen](https://pymatgen.org/)
- [PyTorch](https://pytorch.org/)
- [RDKit](http://www.rdkit.org/docs/Install.html)
- [simdna](https://github.com/kundajelab/simdna)
@@ -139,53 +141,58 @@ pytest -m "not slow" deepchem # optional

Check [this link](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) for more information about the installation of conda environments.

### Install using a Docker (WIP)
### Install using a Docker

### Build the image from Dockerfile
If you want to install using a docker, you can pull two kinds of images.  
DockerHub : https://hub.docker.com/repository/docker/deepchemio/deepchem

We created [sample Dockerfiles](https://github.com/deepchem/deepchem/tree/master/docker) based on the `nvidia/cuda:10.1-cudnn7-devel` image.  
If you want to build your own deepchem environment, these files may be helpful.  
- `docker/x.x.x` : build an image by using conda package manager (x.x.x is a version of deepchem)  
- `docker/master` : build an image from master branch of deepchem source codes
- `deepchemio/deepchem:x.x.x`
  - Image built by using a conda package manager (x.x.x is a version of deepchem)
  - The x.x.x image is built when we push x.x.x. tag
  - Dockerfile is put in `docker/conda-forge` directory
- `deepchemio/deepchem:latest`
  - Image built by the master branch of deepchem source codes
  - The latest image is built every time we commit to the master branch
  - Dockerfile is put in `docker/master` directory

### Use the official deepchem image (WIP)

We couldn't check if this introduction works well or not.

First, you pull the latest stable deepchem docker image.
First, you pull the image you want to use.

```bash
docker pull deepchemio/deepchem
docker pull deepchemio/deepchem:2.3.0
```

Then, you create a container based on our latest image.
Then, you create a container based on the image.

```bash
docker run -it deepchemio/deepchem
docker run --rm -it deepchemio/deepchem:2.3.0
```

If you want GPU support:

```bash
# If nvidia-docker is installed
nvidia-docker run -it deepchemio/deepchem
docker run --runtime nvidia -it deepchemio/deepchem
nvidia-docker run --rm -it deepchemio/deepchem:2.3.0
docker run --runtime nvidia --rm -it deepchemio/deepchem:2.3.0

# If nvidia-container-toolkit is installed
docker run --gpus all -it deepchemio/deepchem
docker run --gpus all --rm -it deepchemio/deepchem:2.3.0
```

You are now in a docker container whose python has deepchem installed.
You are now in a docker container which deepchem was installed. You can start playing with it in the command line.

```bash
# you can start playing with it in the command line
pip install jupyter
ipython
import deepchem as dc
```
(deepchem) root@xxxxxxxxxxxxx:~/mydir# python
Python 3.6.10 |Anaconda, Inc.| (default, May  8 2020, 02:54:21)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import deepchem as dc
```

If you want to check the tox21 benchmark:

# you can run our tox21 benchmark
cd /deepchem/examples
python benchmark.py -d tox21
```bash
(deepchem) root@xxxxxxxxxxxxx:~/mydir# wget https://raw.githubusercontent.com/deepchem/deepchem/master/examples/benchmark.py
(deepchem) root@xxxxxxxxxxxxx:~/mydir# python benchmark.py -d tox21 -m graphconv -s random
```

## FAQ and Troubleshooting
@@ -209,7 +216,7 @@ sudo apt-get install -y libxrender-dev

## Getting Started

The DeepChem project maintains an extensive colelction of [tutorials](https://github.com/deepchem/deepchem/tree/master/examples/tutorials). All tutorials are designed to be run on Google colab (or locally if you prefer). Tutorials are arranged in a suggested learning sequence which will take you from beginner to proficient at molecular machine learning and computational biology more broadly.
The DeepChem project maintains an extensive collection of [tutorials](https://github.com/deepchem/deepchem/tree/master/examples/tutorials). All tutorials are designed to be run on Google colab (or locally if you prefer). Tutorials are arranged in a suggested learning sequence which will take you from beginner to proficient at molecular machine learning and computational biology more broadly.

After working through the tutorials, you can also go through other [examples](https://github.com/deepchem/deepchem/tree/master/examples). To apply `deepchem` to a new problem, try starting from one of the existing examples or tutorials and modifying it step by step to work with your new use-case. If you have questions or comments you can raise them on our [gitter](https://gitter.im/deepchem/Lobby).

+141 −140
Original line number Diff line number Diff line
@@ -336,7 +336,7 @@ class Dataset(object):

  def iterbatches(self,
                  batch_size=None,
                  epoch=0,
                  epochs=1,
                  deterministic=False,
                  pad_batches=False):
    """Get an object that iterates over minibatches from the dataset.
@@ -348,7 +348,7 @@ class Dataset(object):
    ----------
    batch_size: int, optional
      Number of elements in each batch
    epoch: int, optional
    epochs: int, optional
      Number of epochs to walk over dataset
    deterministic: bool, optional
      If True, follow deterministic order.
@@ -485,8 +485,7 @@ class Dataset(object):
    # Create a Tensorflow Dataset.

    def gen_data():
      for epoch in range(epochs):
        for X, y, w, ids in self.iterbatches(batch_size, epoch, deterministic,
      for X, y, w, ids in self.iterbatches(batch_size, epochs, deterministic,
                                           pad_batches):
        yield (X, y, w)

@@ -727,7 +726,7 @@ class NumpyDataset(Dataset):

  def iterbatches(self,
                  batch_size=None,
                  epoch=0,
                  epochs=1,
                  deterministic=False,
                  pad_batches=False):
    """Get an object that iterates over minibatches from the dataset.
@@ -739,7 +738,7 @@ class NumpyDataset(Dataset):
    ----------
    batch_size: int, optional
      Number of elements in each batch
    epoch: int, optional
    epochs: int, optional
      Number of epochs to walk over dataset
    deterministic: bool, optional
      If True, follow deterministic order.
@@ -751,14 +750,15 @@ class NumpyDataset(Dataset):
    Generator which yields tuples of four numpy arrays `(X, y, w, ids)`
    """

    def iterate(dataset, batch_size, deterministic, pad_batches):
    def iterate(dataset, batch_size, epochs, deterministic, pad_batches):
      n_samples = dataset._X.shape[0]
      if not deterministic:
        sample_perm = np.random.permutation(n_samples)
      else:
      if deterministic:
        sample_perm = np.arange(n_samples)
      if batch_size is None:
        batch_size = n_samples
      for epoch in range(epochs):
        if not deterministic:
          sample_perm = np.random.permutation(n_samples)
        batch_idx = 0
        num_batches = np.math.ceil(n_samples / batch_size)
        while batch_idx < num_batches:
@@ -776,7 +776,7 @@ class NumpyDataset(Dataset):
          batch_idx += 1
          yield (X_batch, y_batch, w_batch, ids_batch)

    return iterate(self, batch_size, deterministic, pad_batches)
    return iterate(self, batch_size, epochs, deterministic, pad_batches)

  def itersamples(self):
    """Get an object that iterates over the samples in the dataset.
@@ -1151,7 +1151,7 @@ class DiskDataset(Dataset):

  def iterbatches(self,
                  batch_size=None,
                  epoch=0,
                  epochs=1,
                  deterministic=False,
                  pad_batches=False):
    """ Get an object that iterates over minibatches from the dataset.
@@ -1166,7 +1166,7 @@ class DiskDataset(Dataset):
      Number of elements in a batch. If None, then it yields batches
      with size equal to the size of each individual shard.
    epoch: int
      Not used
      Number of epochs to walk over dataset
    deterministic: bool
      Whether or not we should should shuffle each shard before
      generating the batches.  Note that this is only local in the
@@ -1176,21 +1176,20 @@ class DiskDataset(Dataset):
      it has exactly batch_size elements.
    """
    shard_indices = list(range(self.get_number_shards()))
    return self._iterbatches_from_shards(shard_indices, batch_size,
    return self._iterbatches_from_shards(shard_indices, batch_size, epochs,
                                         deterministic, pad_batches)

  def _iterbatches_from_shards(self,
                               shard_indices,
                               batch_size=None,
                               epochs=1,
                               deterministic=False,
                               pad_batches=False):
    """Get an object that iterates over batches from a restricted set of shards."""

    def iterate(dataset, batch_size):
    def iterate(dataset, batch_size, epochs):
      num_shards = len(shard_indices)
      if not deterministic:
        shard_perm = np.random.permutation(num_shards)
      else:
      if deterministic:
        shard_perm = np.arange(num_shards)

      # (ytz): Depending on the application, thread-based pools may be faster
@@ -1198,16 +1197,17 @@ class DiskDataset(Dataset):
      # objects as an extra overhead. Also, as hideously as un-thread safe this looks,
      # we're actually protected by the GIL.
      pool = Pool(1)  # mp.dummy aliases ThreadPool to Pool
      next_shard = pool.apply_async(dataset.get_shard,
                                    (shard_indices[shard_perm[0]],))

      total_yield = 0

      if batch_size is None:
        num_global_batches = num_shards
      else:
        num_global_batches = math.ceil(dataset.get_shape()[0][0] / batch_size)

      for epoch in range(epochs):
        if not deterministic:
          shard_perm = np.random.permutation(num_shards)
        next_shard = pool.apply_async(dataset.get_shard,
                                      (shard_indices[shard_perm[0]],))
        cur_global_batch = 0
        cur_shard = 0
        carry = None
@@ -1218,7 +1218,7 @@ class DiskDataset(Dataset):
          if cur_shard < num_shards - 1:
            next_shard = pool.apply_async(
                dataset.get_shard, (shard_indices[shard_perm[cur_shard + 1]],))
        else:
          elif epoch == epochs - 1:
            pool.close()

          if carry is not None:
@@ -1285,7 +1285,7 @@ class DiskDataset(Dataset):
            cur_local_batch += 1
          cur_shard += 1

    return iterate(self, batch_size)
    return iterate(self, batch_size, epochs)

  def itersamples(self):
    """Get an object that iterates over the samples in the dataset.
@@ -1922,7 +1922,7 @@ class ImageDataset(Dataset):

  def iterbatches(self,
                  batch_size=None,
                  epoch=0,
                  epochs=1,
                  deterministic=False,
                  pad_batches=False):
    """Get an object that iterates over minibatches from the dataset.
@@ -1931,14 +1931,15 @@ class ImageDataset(Dataset):
    w, ids).
    """

    def iterate(dataset, batch_size, deterministic, pad_batches):
    def iterate(dataset, batch_size, epochs, deterministic, pad_batches):
      n_samples = dataset._X_shape[0]
      if not deterministic:
        sample_perm = np.random.permutation(n_samples)
      else:
      if deterministic:
        sample_perm = np.arange(n_samples)
      if batch_size is None:
        batch_size = n_samples
      for epoch in range(epochs):
        if not deterministic:
          sample_perm = np.random.permutation(n_samples)
        batch_idx = 0
        num_batches = np.math.ceil(n_samples / batch_size)
        while batch_idx < num_batches:
@@ -1964,7 +1965,7 @@ class ImageDataset(Dataset):
          batch_idx += 1
          yield (X_batch, y_batch, w_batch, ids_batch)

    return iterate(self, batch_size, deterministic, pad_batches)
    return iterate(self, batch_size, epochs, deterministic, pad_batches)

  def itersamples(self):
    """Get an object that iterates over the samples in the dataset.
@@ -2143,7 +2144,7 @@ class Databag(object):
    ----------
    batch_size: int
      Number of samples from each dataset to return
    epoch: int
    epochs: int
      Number of times to loop through the datasets
    pad_batches: boolean
      Should all batches==batch_size
+4 −4
Original line number Diff line number Diff line
@@ -436,9 +436,9 @@ class TestDatasets(test_util.TensorFlowTestCase):
                  solubility_dataset.w, solubility_dataset.ids)
    batch_sizes = []
    for X, y, _, _ in solubility_dataset.iterbatches(
        3, pad_batches=False, deterministic=True):
        3, epochs=2, pad_batches=False, deterministic=True):
      batch_sizes.append(len(X))
    self.assertEqual([3, 3, 3, 1], batch_sizes)
    self.assertEqual([3, 3, 3, 1, 3, 3, 3, 1], batch_sizes)

  def test_disk_pad_batches(self):
    shard_sizes = [21, 11, 41, 21, 51]
@@ -663,9 +663,9 @@ class TestDatasets(test_util.TensorFlowTestCase):
        solubility_dataset)
    batch_sizes = []
    for X, y, _, _ in solubility_dataset.iterbatches(
        3, pad_batches=False, deterministic=True):
        3, epochs=2, pad_batches=False, deterministic=True):
      batch_sizes.append(len(X))
    self.assertEqual([3, 3, 3, 1], batch_sizes)
    self.assertEqual([3, 3, 3, 1, 3, 3, 3, 1], batch_sizes)

  def test_merge(self):
    """Test that dataset merge works."""
+6 −3
Original line number Diff line number Diff line
@@ -74,15 +74,18 @@ class TestImageDataset(test_util.TensorFlowTestCase):
    ds = dc.data.ImageDataset(files, np.random.random(10))
    X = ds.X
    iterated_ids = set()
    for x, y, w, ids in ds.iterbatches(2):
    for x, y, w, ids in ds.iterbatches(2, epochs=2):
      np.testing.assert_array_equal([2, 28, 28], x.shape)
      np.testing.assert_array_equal([2], y.shape)
      np.testing.assert_array_equal([2], w.shape)
      np.testing.assert_array_equal([2], ids.shape)
      for i in (0, 1):
        assert ids[i] in files
        if len(iterated_ids) < 10:
          assert ids[i] not in iterated_ids
          iterated_ids.add(ids[i])
        else:
          assert ids[i] in iterated_ids
        index = files.index(ids[i])
        np.testing.assert_array_equal(x[i], X[index])
    assert len(iterated_ids) == 10
+38 −23
Original line number Diff line number Diff line
@@ -7,6 +7,7 @@ import logging
import numpy as np
import os
import tempfile
import tarfile
from subprocess import call
from deepchem.utils.rdkit_util import add_hydrogens_to_mol
from subprocess import check_output
@@ -14,6 +15,7 @@ from deepchem.utils import rdkit_util
from deepchem.utils import mol_xyz_util
from deepchem.utils import geometry_utils
from deepchem.utils import vina_utils
from deepchem.utils import download_url

logger = logging.getLogger(__name__)

@@ -105,6 +107,8 @@ class VinaPoseGenerator(PoseGenerator):
      url = "http://vina.scripps.edu/download/autodock_vina_1_1_2_linux_x86.tgz"
      filename = "autodock_vina_1_1_2_linux_x86.tgz"
      dirname = "autodock_vina_1_1_2_linux_x86"
      self.vina_dir = os.path.join(data_dir, dirname)
      self.vina_cmd = os.path.join(self.vina_dir, "bin/vina")
    elif platform.system() == 'Darwin':
      if sixty_four_bits:
        url = "http://vina.scripps.edu/download/autodock_vina_1_1_2_mac_64bit.tar.gz"
@@ -114,26 +118,31 @@ class VinaPoseGenerator(PoseGenerator):
        url = "http://vina.scripps.edu/download/autodock_vina_1_1_2_mac.tgz"
        filename = "autodock_vina_1_1_2_mac.tgz"
        dirname = "autodock_vina_1_1_2_mac"
      self.vina_dir = os.path.join(data_dir, dirname)
      self.vina_cmd = os.path.join(self.vina_dir, "bin/vina")
    elif platform.system() == 'Windows':
      url = "http://vina.scripps.edu/download/autodock_vina_1_1_2_win32.msi"
      filename = "autodock_vina_1_1_2_win32.msi"
      self.vina_dir = "\\Program Files (x86)\\The Scripps Research Institute\\Vina"
      self.vina_cmd = os.path.join(self.vina_dir, "vina.exe")
    else:
      raise ValueError(
          "This class can only run on Linux or Mac. If you are on Windows, please try using a cloud platform to run this code instead."
          "Unknown operating system.  Try using a cloud platform to run this code instead."
      )
    self.vina_dir = os.path.join(data_dir, dirname)
    self.pocket_finder = pocket_finder
    if not os.path.exists(self.vina_dir):
      logger.info("Vina not available. Downloading")
      wget_cmd = "wget -nv -c -T 15 %s" % url
      check_output(wget_cmd.split())
      download_url(url, data_dir)
      downloaded_file = os.path.join(data_dir, filename)
      logger.info("Downloaded Vina. Extracting")
      untar_cmd = "tar -xzvf %s" % filename
      check_output(untar_cmd.split())
      logger.info("Moving to final location")
      mv_cmd = "mv %s %s" % (dirname, data_dir)
      check_output(mv_cmd.split())
      if platform.system() == 'Windows':
        msi_cmd = "msiexec /i %s" % downloaded_file
        check_output(msi_cmd.split())
      else:
        with tarfile.open(downloaded_file) as tar:
          tar.extractall(data_dir)
      logger.info("Cleanup: removing downloaded vina tar.gz")
      rm_cmd = "rm %s" % filename
      call(rm_cmd.split())
    self.vina_cmd = os.path.join(self.vina_dir, "bin/vina")
      os.remove(downloaded_file)

  def generate_poses(self,
                     molecular_complex,
@@ -207,6 +216,8 @@ class VinaPoseGenerator(PoseGenerator):
    protein_pdbqt = os.path.join(out_dir, "%s.pdbqt" % protein_name)
    protein_mol = rdkit_util.load_molecule(
        protein_file, calc_charges=True, add_hydrogens=True)
    rdkit_util.write_molecule(protein_mol[1], protein_hyd, is_protein=True)
    rdkit_util.write_molecule(protein_mol[1], protein_pdbqt, is_protein=True)

    # Get protein centroid and range
    if centroid is not None and box_dims is not None:
@@ -215,9 +226,6 @@ class VinaPoseGenerator(PoseGenerator):
    else:
      if self.pocket_finder is None:
        logger.info("Pockets not specified. Will use whole protein to dock")
        rdkit_util.write_molecule(protein_mol[1], protein_hyd, is_protein=True)
        rdkit_util.write_molecule(
            protein_mol[1], protein_pdbqt, is_protein=True)
        protein_centroid = geometry_utils.compute_centroid(protein_mol[0])
        protein_range = mol_xyz_util.get_molecule_range(protein_mol[0])
        box_dims = protein_range + 5.0
@@ -276,10 +284,17 @@ class VinaPoseGenerator(PoseGenerator):
      log_file = os.path.join(out_dir, "%s_log.txt" % ligand_name)
      out_pdbqt = os.path.join(out_dir, "%s_docked.pdbqt" % ligand_name)
      logger.info("About to call Vina")
      call(
          "%s --config %s --log %s --out %s" % (self.vina_cmd, conf_file,
                                                log_file, out_pdbqt),
          shell=True)
      if platform.system() == 'Windows':
        args = [
            self.vina_cmd, "--config", conf_file, "--log", log_file, "--out",
            out_pdbqt
        ]
      else:
        # I'm not sure why specifying the args as a list fails on other platforms,
        # but for some reason it only works if I pass it as a string.
        args = "%s --config %s --log %s --out %s" % (self.vina_cmd, conf_file,
                                                     log_file, out_pdbqt)
      call(args, shell=True)
      ligands, scores = vina_utils.load_docked_ligands(out_pdbqt)
      docked_complexes += [(protein_mol[1], ligand) for ligand in ligands]
      all_scores += scores
Loading