Unverified Commit b21fc642 authored by Karl Leswing's avatar Karl Leswing Committed by GitHub
Browse files

Merge branch 'master' into rdkit-upgrade

parents 367e6f42 4f4d72fb
Loading
Loading
Loading
Loading
+1 −1
Original line number Diff line number Diff line
@@ -15,7 +15,7 @@ install:
- conda config --add channels http://conda.binstar.org/omnia
- bash scripts/install_deepchem_conda.sh deepchem
- source activate deepchem
- pip install yapf==0.17.0
- pip install yapf==0.19.0
- pip install coveralls
- python setup.py install
script:
+2 −2
Original line number Diff line number Diff line
@@ -2,7 +2,7 @@ FROM nvidia/cuda

# Install some utilities
RUN apt-get update && \
    apt-get install -y -q wget git libxrender1 && \
    apt-get install -y -q wget git libxrender1 libsm6 && \
    apt-get clean

# Install miniconda
@@ -21,7 +21,7 @@ ENV PATH /miniconda/bin:$PATH
# TODO: Get rid of this when there is a stable release of deepchem.
RUN git clone https://github.com/deepchem/deepchem.git && \
    cd deepchem && \
    git checkout tags/1.3.0 && \
    git checkout tags/1.3.1 && \
    sed -i -- 's/tensorflow$/tensorflow-gpu/g' scripts/install_deepchem_conda.sh && \
    bash scripts/install_deepchem_conda.sh root && \
    python setup.py develop
+5 −544

File changed.

Preview size limit exceeded, changes collapsed.

+30 −0
Original line number Diff line number Diff line
# mol2vec implementation

In the recent mol2vec [paper](https://chemrxiv.org/articles/Mol2vec_Unsupervised_Machine_Learning_Approach_with_Chemical_Intuition/5513581), authors Jaeger et al consider the features returned by the rdkit Morgan fingerprint as "words" and a compound as a "sentence" to generate fixed-length embeddings. In this case we reproduce 200-element embeddings via a download of all SDF files in the PubChem compound database

## Setup

Ensure that gensim is installed via:

```bash
pip install gensim
```

## Creating training corpus

First, download the pubchem compound SDF corpus via running:

```bash
python ../pubchem_dataset/download_pubchem_ftp.sh
```
Note - the script assumes that a /media/data/pubchem directory exists for this large download (approx 19 GB as of November 2017)

Then generate the embeddings file via:

```bash
./train_mol2vec.sh
```

Then you can use these embeddings as a fixed-length alternative to fingerprints derived directly from RDKit. A full implementation as a featurized for deepchem is WIP

Example code for using the vec.txt file that is created by the above script can be found in eval_mol2vec_results
 No newline at end of file
+29 −0
Original line number Diff line number Diff line
import gensim
from gensim import models
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
import numpy as np

def main() :
    model = models.KeyedVectors.load_word2vec_format("vec.txt")
    embeddings = list()

    # Using canonical smiles for glycine, as in original research paper
    mol = Chem.MolFromSmiles("C(C(=O)O)N")
    try:
        info = {}
        rdMolDescriptors.GetMorganFingerprint(mol, 0, bitInfo=info)
        keys = info.keys()
        keys_list = list(keys)
        totalvec = np.zeros(200)
        for k in keys_list:
            wordvec = model.wv[str(k)]
            totalvec = np.add(totalvec, wordvec)
        embeddings.append(totalvec)
    except Exception as e:
        print(e)
        pass

    print(embeddings[0])

Loading