Commit 2b1381cd authored by Bharath Ramsundar's avatar Bharath Ramsundar
Browse files

Merge

parents 2350e182 0ccfdf41
Loading
Loading
Loading
Loading
+56 −10
Original line number Diff line number Diff line
@@ -38,6 +38,8 @@ Stanford and originally created by [Bharath Ramsundar](http://rbharath.github.io

Installation from source is the only currently supported format. ```deepchem``` currently supports both Python 2.7 and Python 3.5, but is not supported on any OS'es except 64 bit linux. Please make sure you follow the directions below precisely. While you may already have system versions of some of these packages, there is no guarantee that `deepchem` will work with alternate versions than those specified below.

### Full Anaconda distribution

1. Download the **64-bit** Python 2.7 or Python 3.5 versions of Anaconda for linux [here](https://www.continuum.io/downloads#_unix). 
   
   Follow the [installation instructions](http://docs.continuum.io/anaconda/install#linux-install)
@@ -102,6 +104,26 @@ Installation from source is the only currently supported format. ```deepchem```
    Note that the full test-suite uses up a fair amount of memory. 
    Try running tests for one submodule at a time if memory proves an issue.

### Using a conda environment
Alternatively, you can install deepchem in a new conda environment using the following bash commands:

```bash
conda create -n deepchem python=3.5 -y                  # Create new env
source activate deepchem                                # Activate it
conda install -c omnia openbabel=2.4.0 rdkit mdtraj -y  # Installs from omnia channel
conda install joblib scikit-learn -y                    # Installs from default channel
pip install six tensorflow-gpu nose                     # Pip installs
git clone https://github.com/deepchem/deepchem.git      # Clone deepchem source code from GitHub
cd deepchem
python setup.py install                                 # Manual install
nosetests -v deepchem --nologcapture                    # Run tests
```
This creates a new conda environment `deepchem` and installs in it the dependencies that
are needed. To access it, use the `source activate deepchem` command.
Check [this link](https://conda.io/docs/using/envs.html) for more information about
the benefits and usage of conda environments. **Warning**: Segmentation faults can [still happen](https://github.com/deepchem/deepchem/pull/379#issuecomment-277013514)
via this installation procedure.

## FAQ
1. Question: I'm seeing some failures in my test suite having to do with MKL
   ```Intel MKL FATAL ERROR: Cannot load libmkl_avx.so or libmkl_def.so.```
@@ -190,6 +212,10 @@ Index splitting
|           |Multitask network   |0.830              |0.678              |
|           |robust MT-NN        |0.825              |0.680              |
|           |graph convolution   |0.821              |0.720              |
|clintox    |logistic regression |0.967              |0.676              |
|           |Multitask network   |0.934              |0.830              |
|           |robust MT-NN        |0.949              |0.827              |
|           |graph convolution   |0.946              |0.860              |

Random splitting

@@ -215,6 +241,10 @@ Random splitting
|           |Multitask network   |0.836        	     |0.684              |
|           |robust MT-NN        |0.822              |0.681              |
|           |graph convolution   |0.820        	     |0.717              |
|clintox    |logistic regression |0.972              |0.725              |
|           |Multitask network   |0.951              |0.834              |
|           |robust MT-NN        |0.959              |0.830              |
|           |graph convolution   |0.975              |0.876              |

Scaffold splitting

@@ -240,17 +270,27 @@ Scaffold splitting
|           |Multitask network   |0.828              |0.617              |
|           |robust MT-NN        |0.830              |0.614              |
|           |graph convolution   |0.832              |0.638              |
|clintox    |logistic regression |0.960              |0.803              |
|           |Multitask network   |0.947              |0.862              |
|           |robust MT-NN        |0.953              |0.890              |
|           |graph convolution   |0.957              |0.823              |

* Regression

|Dataset         |Model               |Splitting   |Train score/R2|Valid score/R2|
|----------------|--------------------|------------|--------------|--------------|
|delaney         |MT-NN regression    |Index       |0.773         |0.574         |
|                |graphconv regression|Index       |0.991         |0.825         |
|                |MT-NN regression    |Random      |0.769         |0.591         |
|                |graphconv regression|Random      |0.996         |0.873         |
|                |MT-NN regression    |Scaffold    |0.782         |0.426         |
|                |graphconv regression|Scaffold    |0.994         |0.606         |
|delaney         |MT-NN regression    |Index       |0.868         |0.578         |
|                |graphconv regression|Index       |0.967         |0.790         |
|                |MT-NN regression    |Random      |0.865         |0.574         |
|                |graphconv regression|Random      |0.964         |0.782         |
|                |MT-NN regression    |Scaffold    |0.866         |0.342         |
|                |graphconv regression|Scaffold    |0.967         |0.606         |
|sampl           |MT-NN regression    |Index       |0.917         |0.764         |
|                |graphconv regression|Index       |0.982         |0.864         |
|                |MT-NN regression    |Random      |0.908         |0.830         |
|                |graphconv regression|Random      |0.987         |0.868         |
|                |MT-NN regression    |Scaffold    |0.891         |0.217         |
|                |graphconv regression|Scaffold    |0.985         |0.666         |
|nci             |MT-NN regression    |Index       |0.171         |0.062         |
|                |graphconv regression|Index       |0.123         |0.048         |
|                |MT-NN regression    |Random      |0.168         |0.085         |
@@ -263,14 +303,16 @@ Scaffold splitting
|chembl          |MT-NN regression    |Index       |0.443         |0.427         |
|                |MT-NN regression    |Random      |0.464         |0.434         |
|                |MT-NN regression    |Scaffold    |0.484         |0.361         |
|gdb7            |MT-NN regression    |Index       |0.961         |0.011         |
|                |MT-NN regression    |Random      |0.742         |0.732         |
|gdb7            |MT-NN regression    |Index       |0.994         |0.010         |
|                |MT-NN regression    |Random      |0.860         |0.773         |
|                |MT-NN regression    |User-defined|0.996         |0.996         | 
|kaggle          |MT-NN regression    |User-defined|0.748         |0.452         |

|Dataset         |Model               |Splitting   |Train score/MAE(kcal/mol)|Valid score/MAE(kcal/mol)|
|----------------|--------------------|------------|-------------------------|-------------------------|
|gdb7            |MT-NN regression    |Index       |44.5                     |185.6                    |
|                |MT-NN regression    |Random      |86.1                     |92.2                     |
|gdb7            |MT-NN regression    |Index       |18.3                     |172.0                    |
|                |MT-NN regression    |Random      |44.3                     |59.1                     |
|                |MT-NN regression    |User-defined|9.0                      |9.5                      |

* General features

@@ -283,7 +325,9 @@ Number of tasks and examples in the datasets
|pcba            |128        |439863     |
|sider           |27         |1427       |
|toxcast         |617        |8615       |
|clintox         |2          |1491       |
|delaney         |1          |1128       |
|sampl           |1          |643        |
|kaggle          |15         |173065     |
|nci             |60         |19127      |
|pdbbind(core)   |1          |195        |
@@ -320,6 +364,8 @@ Time needed for benchmark test(~20h in total)
|                |graph convolution   |80              |900            |
|delaney         |MT-NN regression    |10              |40             |
|                |graphconv regression|10              |40             |
|sampl           |MT-NN regression    |10              |30             |
|                |graphconv regression|10              |40             |
|nci             |MT-NN regression    |400             |1200           |
|                |graphconv regression|400             |2500           |
|pdbbind(core)   |MT-NN regression    |0(featurized)   |30             |
+17 −7
Original line number Diff line number Diff line
@@ -20,7 +20,7 @@ from deepchem.utils.save import load_sdf_files
from deepchem.feat import UserDefinedFeaturizer
from deepchem.data import DiskDataset

def convert_df_to_numpy(df, tasks, id_field, verbose=False):
def convert_df_to_numpy(df, tasks, verbose=False):
  """Transforms a dataframe containing deepchem input into numpy arrays"""
  n_samples = df.shape[0]
  n_tasks = len(tasks)
@@ -39,7 +39,7 @@ def convert_df_to_numpy(df, tasks, id_field, verbose=False):
      if y[ind, task] == "":
        missing[ind, task] = 1

  ids = df[id_field].values
  # ids = df[id_field].values
  # Set missing data to have weight zero
  for ind in range(n_samples):
    for task in range(n_tasks):
@@ -47,7 +47,7 @@ def convert_df_to_numpy(df, tasks, id_field, verbose=False):
        y[ind, task] = 0.
        w[ind, task] = 0.

  return ids, y.astype(float), w.astype(float)
  return y.astype(float), w.astype(float)

def featurize_smiles_df(df, featurizer, field, log_every_N=1000, verbose=True):
  """Featurize individual compounds in dataframe.
@@ -152,10 +152,20 @@ class DataLoader(object):
      for shard_num, shard in enumerate(self.get_shards(input_files, shard_size)):
        time1 = time.time()
        X, valid_inds = self.featurize_shard(shard)
        ids, y, w = convert_df_to_numpy(shard, self.tasks, self.id_field)  
        ids = shard[self.id_field].values
        ids = ids[valid_inds]
        if len(self.tasks) > 0:
          # Featurize task results iff they exist.
          y, w = convert_df_to_numpy(shard, self.tasks, self.id_field)  
          # Filter out examples where featurization failed.
        ids, y, w = (ids[valid_inds], y[valid_inds], w[valid_inds])
          y, w = (y[valid_inds], w[valid_inds])
          assert len(X) == len(ids) == len(y) == len(w)
        else:
          # For prospective data where results are unknown, it makes
          # no sense to have y values or weights.
          y, w = (None, None)
          assert len(X) == len(ids)

        time2 = time.time()
        log("TIMING: featurizing shard %d took %0.3f s" % (shard_num, time2-time1),
            self.verbose)
+79 −28
Original line number Diff line number Diff line
@@ -410,27 +410,40 @@ class DiskDataset(Dataset):
    metadata_entries should have elements returned by write_data_to_disk
    above.
    """
    columns=('basename','task_names', 'ids', 'X', 'y', 'w')
    metadata_df = pd.DataFrame(
        metadata_entries,
        columns=('basename','task_names', 'ids', 'X', 'y', 'w'))
        columns=columns)
    return metadata_df

  @staticmethod
  def write_data_to_disk(data_dir, basename, tasks, X=None, y=None, w=None,
                         ids=None):
    out_X = "%s-X.joblib" % basename
    out_y = "%s-y.joblib" % basename
    out_w = "%s-w.joblib" % basename
    out_ids = "%s-ids.joblib" % basename

    if X is not None:
      out_X = "%s-X.joblib" % basename
      save_to_disk(X, os.path.join(data_dir, out_X))
    else:
      out_X = None

    if y is not None:
      out_y = "%s-y.joblib" % basename
      save_to_disk(y, os.path.join(data_dir, out_y))
    else:
      out_y = None

    if w is not None:
      out_w = "%s-w.joblib" % basename
      save_to_disk(w, os.path.join(data_dir, out_w))
    else:
      out_w = None

    if ids is not None:
      out_ids = "%s-ids.joblib" % basename
      save_to_disk(ids, os.path.join(data_dir, out_ids))
    else:
      out_ids = None

    # note that this corresponds to the _construct_metadata column order
    return [basename, tasks, out_ids, out_X, out_y, out_w]

  def save_to_disk(self):
@@ -531,15 +544,22 @@ class DiskDataset(Dataset):
      for _, row in dataset.metadata_df.iterrows():
        X = np.array(load_from_disk(
            os.path.join(dataset.data_dir, row['X'])))
        ids = np.array(load_from_disk(
            os.path.join(dataset.data_dir, row['ids'])), dtype=object)
        # These columns may be missing is the dataset is unlabelled.
        if row['y'] is not None:
          y = np.array(load_from_disk(
            os.path.join(dataset.data_dir, row['y'])))
        else:
          y = None
        if row['w'] is not None:
          w_filename = os.path.join(dataset.data_dir, row['w'])
          if os.path.exists(w_filename):
              w = np.array(load_from_disk(w_filename))
          else:
              w = np.ones(y.shape)
        ids = np.array(load_from_disk(
            os.path.join(dataset.data_dir, row['ids'])), dtype=object)
        else:
          w = None
        yield (X, y, w, ids)
    return iterate(self)

@@ -576,8 +596,17 @@ class DiskDataset(Dataset):
          indices = range(interval_points[j], interval_points[j+1])
          perm_indices = sample_perm[indices]
          X_batch = X[perm_indices]

          if y is not None:
            y_batch = y[perm_indices]
          else:
            y_batch = None

          if w is not None:
            w_batch = w[perm_indices]
          else:
            w_batch = None

          ids_batch = ids[perm_indices]
          if pad_batches:
            (X_batch, y_batch, w_batch, ids_batch) = pad_batch(
@@ -597,7 +626,12 @@ class DiskDataset(Dataset):
        for (X_shard, y_shard, w_shard, ids_shard) in dataset.itershards():
            n_samples = X_shard.shape[0]
            for i in range(n_samples):
                yield (X_shard[i], y_shard[i], w_shard[i], ids_shard[i])
                def sanitize(elem):
                  if elem is None:
                    return None
                  else:
                    return elem[i]
                yield map(sanitize, [X_shard, y_shard, w_shard, ids_shard])
    return iterate(self)

  def transform(self, fn, **args):
@@ -755,13 +789,23 @@ class DiskDataset(Dataset):
    row = self.metadata_df.iloc[i]
    X = np.array(load_from_disk(
        os.path.join(self.data_dir, row['X'])))

    if row['y'] is not None:
      y = np.array(load_from_disk(
        os.path.join(self.data_dir, row['y'])))
    else:
      y = None

    if row['w'] is not None:
      # TODO (ytz): Under what condition does this exist but the file itself doesn't?
      w_filename = os.path.join(self.data_dir, row['w'])
      if os.path.exists(w_filename):
          w = np.array(load_from_disk(w_filename))
      else:
          w = np.ones(y.shape)
    else:
      w = None

    ids = np.array(load_from_disk(
        os.path.join(self.data_dir, row['ids'])), dtype=object)
    return (X, y, w, ids)
@@ -876,7 +920,7 @@ class DiskDataset(Dataset):
    """
    total = 0
    for _, row in self.metadata_df.iterrows():
      y = load_from_disk(os.path.join(self.data_dir, row['y']))
      y = load_from_disk(os.path.join(self.data_dir, row['ids']))
      total += len(y)
    return total

@@ -884,17 +928,24 @@ class DiskDataset(Dataset):
    """Finds shape of dataset."""
    n_tasks = len(self.get_task_names())
    X_shape = np.array((0,) + (0,) * len(self.get_data_shape())) 
    ids_shape = np.array((0,))
    if n_tasks > 0:
      y_shape = np.array((0,) + (0,))
      w_shape = np.array((0,) + (0,))
    ids_shape = np.array((0,))
    else:
      y_shape = tuple()
      w_shape = tuple()

    for shard_num, (X, y, w, ids) in enumerate(self.itershards()):
      if shard_num == 0:
        X_shape += np.array(X.shape)
        if n_tasks > 0:
          y_shape += np.array(y.shape)
          w_shape += np.array(w.shape)
        ids_shape += np.array(ids.shape)
      else:
        X_shape[0] += np.array(X.shape)[0]
        if n_tasks > 0:
          y_shape[0] += np.array(y.shape)[0]
          w_shape[0] += np.array(w.shape)[0]
        ids_shape[0] += np.array(ids.shape)[0]
+9 −0
Original line number Diff line number Diff line
@@ -91,3 +91,12 @@ def load_gaussian_cdf_data():
  loader = dc.data.UserCSVLoader(
      tasks=tasks, featurizer=featurizer, id_field="id")
  return loader.featurize(input_file)

def load_unlabelled_data():
  current_dir = os.path.dirname(os.path.abspath(__file__))
  featurizer = dc.feat.CircularFingerprint(size=1024)
  tasks = []
  input_file = os.path.join(current_dir, "../../data/tests/no_labels.csv")
  loader = dc.data.CSVLoader(
      tasks=tasks, smiles_field="smiles", featurizer=featurizer)
  return loader.featurize(input_file)
 No newline at end of file
+26 −0
Original line number Diff line number Diff line
smiles,id
O=C1CCc2c(N1)[c-]c([c-][c-]2)OCCCC[N+]1([O-])CCN(CC1)c1[c-][c-][c-]c(c1Cl)Cl,48866084_50429806
O=C1CCc2c(N1)[c-]c([c-][c-]2)OCCCCN1CC[N+](CC1)([O-])c1[c-][c-][c-]c(c1Cl)Cl,48866086_50429808
CO[C@H]1O[C@H]2O[C@]3(C)CC[C@H]4[C@@]2([C@@H]([C@H]1C)CC[C@@H]4C)OO3,48866088_48866087
O=C1O[C@@H]2O[C@]3(C)CC[C@H]4[C@@]2([C@H]([C@@H]1C)CC[C@@H]4C)OO3,48866090_48866089
O=C1O[C@@H]2O[C@]3(C)CC[C@H]4[C@@]2([C@H](C1=C)CC[C@@H]4C)OO3,48866092_48866091
OCC1O[C@@H](O[C@@H]2C[C@@H](C(=O)O)[C@@H]3[C@](C2)(C)[C@@H]2CC[C@@H]4C[C@@]2(CC3)[C@@H](O)C4=C)C(C([C@@H]1OS(=O)(=O)[O-])OS(=O)(=O)[O-])OC(=O)CC(C)C.[Na+].[Na+],48866104_48866103
OC1C[C@@H](O[C@@H]1COP(=O)(O)O)n1cnc(nc1=O)N,48866106_48866105
C/C=C(/C(=O)OC1C[C@H](OC(=O)C)C2([C@@H]3[C@@]41CO[C@@]([C@H]4[C@@](C)([C@H]([C@H]3OC2)O)[C@@]12OC2(C)C2CC1O[C@@H]1C2(O)C=CO1)(O)C(=O)OC)C(=O)OC)\C,48866108_48866107
CN1CCC(=C2c3[c-][c-][c-][c-]c3CCc3c2n[c-][c-][c-]3)CC1.OC(=O)/C=C\C(=O)O,48866111_33542275
Clc1[c-][c-]c([c-][c-]1)Cc1nn(C2CCC[N+](CC2)([O-])C)c(=O)c2c1[c-][c-][c-][c-]2,48866115_48866114
CC[C@@H]1OC(=O)[C@H](C)[C@H](OC2OC(C)C(C(C2)(C)OC)O)[C@@H](C)[C@H](OC2OC(C)CC(C2O)[N+](C)(C)[O-])[C@](C[C@@H](CN([C@@H]([C@H](C1(C)O)O)C)C)C)(C)O,48866130_48866129
CO/C=C(\c1[c-][c-][c-][c-]c1Oc1n[c-]nc([c-]1)Oc1[c-][c-][c-][c-]c1C#N)/C(=O)OC,48866134_207297540
COC(=O)C1=C(C)NC(=C([C@@H]1c1cccc(c1)[N+](=O)[O-])C(=O)O[C@H]1CCN(C1)Cc1ccccc1)C.Cl,48866140_48866139
O=S1(=O)N[C@H](Cc2[c-][c-][c-][c-][c-]2)Nc2c1[c-]c(c([c-]2)C(F)(F)F)S(=O)(=O)N,48866148_48866147
O=S1(=O)N[C@@H](Cc2[c-][c-][c-][c-][c-]2)Nc2c1[c-]c(c([c-]2)C(F)(F)F)S(=O)(=O)N,48866150_48866149
[c-]1[c-][c-]c([c-][c-]1)/C=N/N=C/c1[c-][c-][c-][c-][c-]1,48866152_48866151
O=C(c1[c-][c-][c-][c-][c-]1)NOCC(=O)O,48866154_48866153
CC(CC(c1[c-][c-]c([c-][c-]1)OCCOCC[N+](Cc1[c-][c-][c-][c-][c-]1)(C)C)(C)C)(C)C.[Cl-],48866156_515814
O=C1CN(C1)C(c1[c-][c-][c-][c-][c-]1)c1[c-][c-][c-][c-][c-]1,48866158_48866157
OC(=O)c1[c-][c-]c2c([c-]1)n[c-]n2,48866160_48866159
Cc1c(OCC(F)(F)F)[c-][c-]n2c1c(Sc1nc3c(n1)[c-][c-][c-][c-]3)n1c2nc2c1[c-][c-][c-][c-]2,48866162_48866161
CCc1oc2c(c1C(=O)c1[c-]c(I)c(c([c-]1)I)O)[c-][c-][c-][c-]2,48866164_48866163
[c-]1[c-]c2[c-]c3c4[c-][c-][c-][c-]c4[c-][c-]c3c3c2c([c-]1)[C-]=[C-]3.[c-]1[c-][c-]c2c([c-]1)[c-]c1c3c2[C-]=[C-]c3[c-]c2c1[c-][c-][c-][c-]2,48866166_48866165
O=C1CC(=O)Nc2c(N1)[c-][c-][c-][c-]2,48866168_48866167
ClCC(=O)N1[C@@H](Cc2c([C@H]1c1[c-][c-]c3c([c-]1)OCO3)nc1c2[c-][c-][c-][c-]1)C(=O)OC,48866170_207350992
Loading