Commit afe44833 authored by Bharath Ramsundar's avatar Bharath Ramsundar
Browse files

Merge branch 'master' of https://github.com/deepchem/deepchem into binding_pocket_feat

parents 399f650e c0a17207
Loading
Loading
Loading
Loading
+62 −79
Original line number Diff line number Diff line
@@ -46,48 +46,28 @@ Installation from source is the only currently supported format. ```deepchem```
   conda install -c omnia openbabel=2.4.0
   ``` 

3. `pandas`
   ```bash
   conda install pandas 
   ```

4. `rdkit`
3. `rdkit`
   ```bash
   conda install -c omnia rdkit
   ```

5. `boost`
   ```bash
   conda install -c omnia boost=1.59.0
   ```

6. `joblib`
4. `joblib`
   ```bash
   conda install joblib 
   ```

7. `keras`
   ```bash
   pip install keras --user
   ```
   `deepchem` only supports the `tensorflow` backend for keras. To set the backend to `tensorflow`,
   add the following line to your `~/.bashrc`
5. `keras`
   ```bash
   export KERAS_BACKEND=tensorflow
   pip install keras
   ```
   See [keras docs](https://keras.io/backend/) for more details and alternate methods of setting backend.
   `deepchem` only supports the `tensorflow` (default) backend for keras.
   
8. `mdtraj`
6. `mdtraj`
   ```bash
   conda install -c omnia mdtraj
   ```

9. `scikit-learn`
   ```bash
   conda install scikit-learn 
   ```

10. `tensorflow`: Installing `tensorflow` on older versions of Linux (which
7. `tensorflow`: Installing `tensorflow` on older versions of Linux (which
    have glibc < 2.17) can be very challenging. For these older Linux versions,
    contact your local sysadmin to work out a custom installation. If your
    version of Linux is recent, then the following command will work:
@@ -95,12 +75,7 @@ Installation from source is the only currently supported format. ```deepchem```
    conda install -c https://conda.anaconda.org/jjhelmus tensorflow
    ```

11. `h5py`:
    ```
    conda install h5py
    ```

12. `deepchem`: Clone the `deepchem` github repo:
8. `deepchem`: Clone the `deepchem` github repo:
   ```bash
   git clone https://github.com/deepchem/deepchem.git
   ```
@@ -109,9 +84,9 @@ Installation from source is the only currently supported format. ```deepchem```
   python setup.py install
   ```

13. To run test suite, install `nosetests`:
9. To run test suite, install `nosetests`:
   ```bash
    pip install nose --user
   pip install nose
   ```
   Make sure that the correct version of `nosetests` is active by running
   ```bash
@@ -120,7 +95,7 @@ Installation from source is the only currently supported format. ```deepchem```
   You might need to uninstall a system install of `nosetests` if
   there is a conflict.

14. If installation has been successful, all tests in test suite should pass:
10. If installation has been successful, all tests in test suite should pass:
    ```bash
    nosetests -v deepchem --nologcapture 
    ```
@@ -181,10 +156,12 @@ Environmental Protection Agency, Environmental Research Laboratory, 1987.
Most machine learning algorithms require that input data form vectors.
However, input data for drug-discovery datasets routinely come in the
format of lists of molecules and associated experimental readouts. To
transform lists of molecules into vectors, we need to use the DeepChem
loader class ``dc.load.DataLoader``. Instances of this class must be
passed a ``Featurizer`` object. DeepChem provides a number of
different subclasses of ``Featurizer`` for convenience:
transform lists of molecules into vectors, we need to subclasses of DeepChem
loader class ```dc.data.DataLoader``` such as ```dc.data.CSVLoader``` or 
```dc.data.SDFLoader```. Users can subclass ```dc.data.DataLoader``` to
load arbitrary file formats. All loaders must be
passed a ```dc.feat.Featurizer``` object. DeepChem provides a number of
different subclasses of ```dc.feat.Featurizer``` for convenience.

### Performances
* Classification
@@ -218,26 +195,26 @@ Random splitting

|Dataset    |Model               |Train score/ROC-AUC|Valid score/ROC-AUC|
|-----------|--------------------|-------------------|-------------------|
|tox21      |logistic regression |0.903              |0.741              |
|           |Multitask network   |0.846              |0.812              |
|           |robust MT-NN        |0.844              |0.793              |
|           |graph convolution   |0.872              |0.816              |
|muv        |logistic regression |0.961              |0.696              |
|           |Multitask network   |0.895              |0.740              |
|           |robust MT-NN        |0.914              |0.667              |
|           |graph convolution   |0.846              |0.776              |
|pcba       |logistic regression |0.807        	     |0.772              |
|           |Multitask network   |0.811        	     |0.787              |
|           |robust MT-NN        |0.809              |0.778              |
|           |graph convolution   |0.875       	     |0.844              |
|sider      |logistic regression |0.932        	     |0.628              |
|           |Multitask network   |0.779        	     |0.665              |
|           |robust MT-NN        |0.761              |0.621              |
|           |graph convolution   |0.706        	     |0.638              |
|toxcast    |logistic regression |0.737        	     |0.543              |
|           |Multitask network   |0.831        	     |0.684              |
|           |robust MT-NN        |0.814              |0.692              |
|           |graph convolution   |0.820        	     |0.692              |
|tox21      |logistic regression |0.903              |0.735              |
|           |Multitask network   |0.856              |0.783              |
|           |robust MT-NN        |0.855              |0.773              |
|           |graph convolution   |0.865              |0.827              |
|muv        |logistic regression |0.957              |0.719              |
|           |Multitask network   |0.902              |0.734              |
|           |robust MT-NN        |0.933              |0.732              |
|           |graph convolution   |0.860              |0.730              |
|pcba       |logistic regression |0.808        	     |0.776              |
|           |Multitask network   |0.811        	     |0.778              |
|           |robust MT-NN        |0.811              |0.771              |
|           |graph convolution   |0.872       	     |0.844              |
|sider      |logistic regression |0.929        	     |0.656              |
|           |Multitask network   |0.777        	     |0.655              |
|           |robust MT-NN        |0.804              |0.630              |
|           |graph convolution   |0.705        	     |0.618              |
|toxcast    |logistic regression |0.725        	     |0.586              |
|           |Multitask network   |0.836        	     |0.684              |
|           |robust MT-NN        |0.822              |0.681              |
|           |graph convolution   |0.820        	     |0.717              |

Scaffold splitting

@@ -269,11 +246,14 @@ Scaffold splitting
|Dataset    |Model               |Splitting   |Train score/R2|Valid score/R2|
|-----------|--------------------|------------|--------------|--------------|
|delaney    |MT-NN regression    |Index       |0.773         |0.574         |
|           |graphconv regression|Index       |0.964         |0.829         |
|           |graphconv regression|Index       |0.991         |0.825         |
|           |MT-NN regression    |Random      |0.769         |0.591         |
|           |graphconv regression|Random      |0.959         |0.821         |
|           |graphconv regression|Random      |0.996         |0.873         |
|           |MT-NN regression    |Scaffold    |0.782         |0.426         |
|           |graphconv regression|Scaffold    |0.976         |0.581         |
|           |graphconv regression|Scaffold    |0.994         |0.606         |
|nci        |MT-NN regression    |Index       |0.890         |0.890         |
|           |MT-NN regression    |Random      |0.891         |0.888         |
|           |MT-NN regression    |Scaffold    |0.912         |0.020         |
|kaggle     |MT-NN regression    |User-defined|0.748         |0.452         |

* General features
@@ -289,6 +269,7 @@ Number of tasks and examples in the datasets
|toxcast    |617        |8615       |
|delaney    |1          |1128       |
|kaggle     |15         |173065     |
|nci        |60         |1057371    |

Time needed for benchmark test(~20h in total)

@@ -315,6 +296,8 @@ Time needed for benchmark test(~20h in total)
|           |robust MT-NN        |80              |4000           |
|           |graph convolution   |80              |900            |
|delaney    |MT-NN regression    |10              |40             |
|           |graphconv regression|10              |40             |
|nci        |MT-NN regression    |2000            |30000          |
|kaggle     |MT-NN regression    |2200            |3200           |


+2 −2
Original line number Diff line number Diff line
@@ -34,7 +34,7 @@ class VinaGridRFDocker(Docker):
    """Builds model."""
    self.base_dir = tempfile.mkdtemp()
    print("About to download trained model.")
    call(("wget http://deepchem.io.s3-website-us-west-1.amazonaws.com/trained_models/random_full_RF.tar.gz").split())
    call(("wget -c http://deepchem.io.s3-website-us-west-1.amazonaws.com/trained_models/random_full_RF.tar.gz").split())
    call(("tar -zxvf random_full_RF.tar.gz").split())
    call(("mv random_full_RF %s" % (self.base_dir)).split())
    self.model_dir = os.path.join(self.base_dir, "random_full_RF")
@@ -60,7 +60,7 @@ class VinaGridDNNDocker(object):
    """Builds model."""
    self.base_dir = tempfile.mkdtemp()
    print("About to download trained model.")
    call(("wget http://deepchem.io.s3-website-us-west-1.amazonaws.com/trained_models/random_full_DNN.tar.gz").split())
    call(("wget -c http://deepchem.io.s3-website-us-west-1.amazonaws.com/trained_models/random_full_DNN.tar.gz").split())
    call(("tar -zxvf random_full_DNN.tar.gz").split())
    call(("mv random_full_DNN %s" % (self.base_dir)).split())
    self.model_dir = os.path.join(self.base_dir, "random_full_DNN")
+1 −1
Original line number Diff line number Diff line
@@ -67,7 +67,7 @@ class VinaPoseGenerator(PoseGenerator):
      print("Vina not available. Downloading")
      # TODO(rbharath): May want to move this file to S3 so we can ensure it's
      # always available.
      wget_cmd = "wget http://vina.scripps.edu/download/autodock_vina_1_1_2_linux_x86.tgz"
      wget_cmd = "wget -c http://vina.scripps.edu/download/autodock_vina_1_1_2_linux_x86.tgz"
      call(wget_cmd.split())
      print("Downloaded Vina. Extracting")
      download_cmd = "tar xzvf autodock_vina_1_1_2_linux_x86.tgz"
+1 −1
Original line number Diff line number Diff line
@@ -25,7 +25,7 @@ class TestPoseScoring(unittest.TestCase):
  """
  def setUp(self):
    """Downloads dataset."""
    call("wget http://deepchem.io.s3-website-us-west-1.amazonaws.com/featurized_datasets/core_grid.tar.gz".split())
    call("wget -c http://deepchem.io.s3-website-us-west-1.amazonaws.com/featurized_datasets/core_grid.tar.gz".split())
    call("tar -zxvf core_grid.tar.gz".split())
    self.core_dataset = dc.data.DiskDataset("core_grid/")

+1 −1
Original line number Diff line number Diff line
@@ -110,7 +110,7 @@ def atom_features(atom, bool_id_feat=False):
         'Sb', 'Sn', 'Ag', 'Pd', 'Co', 'Se', 'Ti', 'Zn', 'H',    # H?
         'Li', 'Ge', 'Cu', 'Au', 'Ni', 'Cd', 'In', 'Mn', 'Zr',
         'Cr', 'Pt', 'Hg', 'Pb', 'Unknown']) +
        one_of_k_encoding(atom.GetDegree(), [0, 1, 2, 3, 4, 5, 6]) +
        one_of_k_encoding(atom.GetDegree(), [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) +
        one_of_k_encoding_unk(atom.GetTotalNumHs(), [0, 1, 2, 3, 4]) +
        one_of_k_encoding_unk(atom.GetImplicitValence(), [0, 1, 2, 3, 4, 5, 6]) +
        [atom.GetFormalCharge(), atom.GetNumRadicalElectrons()] +
Loading