Unverified Commit c9eaf1b6 authored by Bharath Ramsundar's avatar Bharath Ramsundar Committed by GitHub
Browse files

Merge pull request #1916 from ncfrey/materials_featurizers

[WIP] Adding inorganic crystal featurizers
parents 5a442e5e d3f25875
Loading
Loading
Loading
Loading
+3 −1
Original line number Diff line number Diff line
@@ -47,12 +47,14 @@ DeepChem has a number of "soft" requirements. These are packages which are neede

- [BioPython](https://biopython.org/wiki/Documentation)
- [OpenAI Gym](https://gym.openai.com/)
- [matminer](https://hackingmaterials.lbl.gov/matminer/)
- [MDTraj](http://mdtraj.org/)
- [NetworkX](https://networkx.github.io/documentation/stable/index.html)
- [OpenMM](http://openmm.org/)
- [PDBFixer](https://github.com/pandegroup/pdbfixer)
- [Pillow](https://pypi.org/project/Pillow/)
- [pyGPGO](https://pygpgo.readthedocs.io/en/latest/)
- [Pymatgen](https://pymatgen.org/)
- [PyTorch](https://pytorch.org/)
- [RDKit](http://www.rdkit.org/docs/Install.html)
- [simdna](https://github.com/kundajelab/simdna)
@@ -214,7 +216,7 @@ sudo apt-get install -y libxrender-dev

## Getting Started

The DeepChem project maintains an extensive colelction of [tutorials](https://github.com/deepchem/deepchem/tree/master/examples/tutorials). All tutorials are designed to be run on Google colab (or locally if you prefer). Tutorials are arranged in a suggested learning sequence which will take you from beginner to proficient at molecular machine learning and computational biology more broadly.
The DeepChem project maintains an extensive collection of [tutorials](https://github.com/deepchem/deepchem/tree/master/examples/tutorials). All tutorials are designed to be run on Google colab (or locally if you prefer). Tutorials are arranged in a suggested learning sequence which will take you from beginner to proficient at molecular machine learning and computational biology more broadly.

After working through the tutorials, you can also go through other [examples](https://github.com/deepchem/deepchem/tree/master/examples). To apply `deepchem` to a new problem, try starting from one of the existing examples or tutorials and modifying it step by step to work with your new use-case. If you have questions or comments you can raise them on our [gitter](https://gitter.im/deepchem/Lobby).

+1 −0
Original line number Diff line number Diff line
@@ -23,3 +23,4 @@ from deepchem.feat.atomic_coordinates import AtomicCoordinates
from deepchem.feat.atomic_coordinates import NeighborListComplexAtomicCoordinates
from deepchem.feat.adjacency_fingerprints import AdjacencyFingerprint
from deepchem.feat.smiles_featurizers import SmilesToSeq, SmilesToImage
from deepchem.feat.materials_featurizers import ElementPropertyFingerprint, SineCoulombMatrix, StructureGraphFeaturizer
+299 −0
Original line number Diff line number Diff line
"""
Featurizers for inorganic crystals.
"""

import numpy as np

from deepchem.feat import Featurizer
from deepchem.utils import pad_array


class ElementPropertyFingerprint(Featurizer):
  """
  Fingerprint of elemental properties from composition.

  Based on the data source chosen, returns properties and statistics
  (min, max, range, mean, standard deviation, mode) for a compound
  based on elemental stoichiometry. E.g., the average electronegativity
  of atoms in a crystal structure. The chemical fingerprint is a 
  vector of these statistics. For a full list of properties and statistics,
  see ``matminer.featurizers.composition.ElementProperty(data_source).feature_labels()``.

  This featurizer requires the optional dependencies pymatgen and
  matminer. It may be useful when only crystal compositions are available
  (and not 3D coordinates).

  References
  ----------
  MagPie data: Ward, L. et al. npj Comput Mater 2, 16028 (2016).
    https://doi.org/10.1038/npjcompumats.2016.28

  Deml data: Deml, A. et al. Physical Review B 93, 085142 (2016).
    10.1103/PhysRevB.93.085142

  Matminer: Ward, L. et al. Comput. Mater. Sci. 152, 60-69 (2018).

  Pymatgen: Ong, S.P. et al. Comput. Mater. Sci. 68, 314-319 (2013). 

  """

  def __init__(self, data_source='matminer'):
    """
    Parameters
    ----------
    data_source : {"matminer", "magpie", "deml"}
      Source for element property data.

    """

    self.data_source = data_source

  def _featurize(self, comp):
    """
    Calculate chemical fingerprint from crystal composition.

    Parameters
    ----------
    comp : str
      Reduced formula of crystal.

    Returns
    -------
    feats: np.ndarray
      Vector of properties and statistics derived from chemical
      stoichiometry. Some values may be NaN.

    """

    from pymatgen import Composition
    from matminer.featurizers.composition import ElementProperty

    # Get pymatgen Composition object
    c = Composition(comp)

    ep = ElementProperty.from_preset(self.data_source)

    try:
      feats = ep.featurize(c)
    except:
      feats = []

    return np.array(feats)


class SineCoulombMatrix(Featurizer):
  """
  Calculate sine Coulomb matrix for crystals.

  A variant of Coulomb matrix for periodic crystals.

  The sine Coulomb matrix is identical to the Coulomb matrix, except
  that the inverse distance function is replaced by the inverse of
  sin**2 of the vector between sites which are periodic in the 
  dimensions of the crystal lattice.

  Features are flattened into a vector of matrix eigenvalues by default
  for ML-readiness. To ensure that all feature vectors are equal
  length, the maximum number of atoms (eigenvalues) in the input
  dataset must be specified.

  This featurizer requires the optional dependencies pymatgen and
  matminer. It may be useful when crystal structures with 3D coordinates 
  are available.

  References
  ----------
  Faber et al. Inter. J. Quantum Chem. 115, 16, 2015.

  """

  def __init__(self, max_atoms, flatten=True):
    """
    Parameters
    ----------
    max_atoms : int
      Maximum number of atoms for any crystal in the dataset. Used to
      pad the Coulomb matrix.
    flatten : bool (default True)
      Return flattened vector of matrix eigenvalues.

    """

    self.max_atoms = int(max_atoms)
    self.flatten = flatten

  def _featurize(self, struct):
    """
    Calculate sine Coulomb matrix from pymatgen structure.

    Parameters
    ----------
    struct : dict
      Json-serializable dictionary representation of pymatgen.core.structure
      https://pymatgen.org/pymatgen.core.structure.html

    Returns
    -------
    features: np.ndarray
      2D sine Coulomb matrix with shape (max_atoms, max_atoms),
      or 1D matrix eigenvalues with shape (max_atoms,). 

    """

    from pymatgen import Structure
    from matminer.featurizers.structure import SineCoulombMatrix as SCM

    s = Structure.from_dict(struct)

    # Get full N x N SCM
    scm = SCM(flatten=False)
    sine_mat = scm.featurize(s)

    if self.flatten:
      eigs, _ = np.linalg.eig(sine_mat)
      zeros = np.zeros((self.max_atoms,))
      zeros[:len(eigs)] = eigs
      features = zeros
    else:
      features = pad_array(sine_mat, self.max_atoms)

    features = np.asarray(features)

    return features


class StructureGraphFeaturizer(Featurizer):
  """
  Calculate structure graph features for crystals.

  Based on the implementation in Crystal Graph Convolutional
  Neural Networks (CGCNN). The method constructs a crystal graph
  representation including atom features (atomic numbers) and bond
  features (neighbor distances). Neighbors are determined by searching
  in a sphere around atoms in the unit cell. A Gaussian filter is
  applied to neighbor distances. All units are in angstrom.  

  This featurizer requires the optional dependency pymatgen. It may
  be useful when 3D coordinates are available and when using graph 
  network models and crystal graph convolutional networks.

  References
  ----------
  T. Xie and J. C. Grossman, Phys. Rev. Lett. 120, 2018.

  """

  def __init__(self, radius=8.0, max_neighbors=12, step=0.2):
    """
    Parameters
    ----------
    radius : float (default 8.0)
      Radius of sphere for finding neighbors of atoms in unit cell.
    max_neighbors : int (default 12)
      Maximum number of neighbors to consider when constructing graph.
    step : float (default 0.2)
      Step size for Gaussian filter.

    """

    self.radius = radius
    self.max_neighbors = int(max_neighbors)
    self.step = step

  def _featurize(self, struct):
    """
    Calculate crystal graph features from pymatgen structure.

    Parameters
    ----------
    struct : dict
      Json-serializable dictionary representation of pymatgen.core.structure
      https://pymatgen.org/pymatgen.core.structure.html

    Returns
    -------
    feats: np.array
      Atomic and bond features. Atomic features are atomic numbers 
      and bond features are Gaussian filtered interatomic distances.

    """

    from pymatgen import Structure

    # Get pymatgen structure object
    s = Structure.from_dict(struct)

    features = self._get_structure_graph_features(s)
    features = np.array(features)

    return features

  def _get_structure_graph_features(self, struct):
    """
    Calculate structure graph features from pymatgen structure.

    Parameters
    ----------
    struct : pymatgen.core.structure
      A periodic crystal composed of a lattice and a sequence of atomic
      sites with 3D coordinates and elements.

    Returns
    -------
    feats: tuple[np.array]
      atomic numbers, filtered interatomic distance tensor, and neighbor ids
    
    """

    atom_features = np.array([site.specie.Z for site in struct], dtype='int32')

    neighbors = struct.get_all_neighbors(self.radius, include_index=True)
    neighbors = [sorted(n, key=lambda x: x[1]) for n in neighbors]

    # Get list of lists of neighbor distances
    neighbor_features, neighbor_idx = [], []
    for neighbor in neighbors:
      if len(neighbor) < self.max_neighbors:
        neighbor_idx.append(
            list(map(lambda x: x[2], neighbor)) +
            [0] * (self.max_neighbors - len(neighbor)))
        neighbor_features.append(
            list(map(lambda x: x[1], neighbor)) +
            [self.radius + 1.] * (self.max_neighbors - len(neighbor)))
      else:
        neighbor_idx.append(
            list(map(lambda x: x[2], neighbor[:self.max_neighbors])))
        neighbor_features.append(
            list(map(lambda x: x[1], neighbor[:self.max_neighbors])))

    neighbor_features = np.array(neighbor_features)
    neighbor_idx = np.array(neighbor_idx)
    neighbor_features = self._gaussian_filter(neighbor_features)
    neighbor_features = np.vstack(neighbor_features)

    return (atom_features, neighbor_features, neighbor_idx)

  def _gaussian_filter(self, distances):
    """
    Apply Gaussian filter to an array of interatomic distances.

    Parameters
    ----------
    distances : np.array
      Matrix of distances of dimension (num atoms) x (max neighbors). 

    Returns
    -------
    expanded_distances: np.array 
      Expanded distance tensor after Gaussian filtering. Dimensionality
      is (num atoms) x (max neighbors) x (len(filt))
    
    """

    filt = np.arange(0, self.radius + self.step, self.step)

    # Increase dimension of distance tensor and apply filter
    expanded_distances = np.exp(
        -(distances[..., np.newaxis] - filt)**2 / self.step**2)

    return expanded_distances
+82 −0
Original line number Diff line number Diff line
"""
Test featurizers for inorganic crystals.
"""
import numpy as np
import unittest

from deepchem.feat.materials_featurizers import ElementPropertyFingerprint, SineCoulombMatrix, StructureGraphFeaturizer


class TestMaterialFeaturizers(unittest.TestCase):
  """
  Test material featurizers.
  """

  def setUp(self):
    """
    Set up tests.
    """
    self.formula = 'MoS2'
    self.struct_dict = {
        '@module':
        'pymatgen.core.structure',
        '@class':
        'Structure',
        'charge':
        None,
        'lattice': {
            'matrix': [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]],
            'a': 1.0,
            'b': 1.0,
            'c': 1.0,
            'alpha': 90.0,
            'beta': 90.0,
            'gamma': 90.0,
            'volume': 1.0
        },
        'sites': [{
            'species': [{
                'element': 'Fe',
                'occu': 1
            }],
            'abc': [0.0, 0.0, 0.0],
            'xyz': [0.0, 0.0, 0.0],
            'label': 'Fe',
            'properties': {}
        }]
    }

  def test_element_property_fingerprint(self):
    """
    Test Element Property featurizer.
    """

    featurizer = ElementPropertyFingerprint(data_source='matminer')
    features = featurizer.featurize([self.formula])

    assert len(features[0]) == 65
    assert np.allclose(
        features[0][:5], [2.16, 2.58, 0.42, 2.44, 0.29698485], atol=0.1)

  def test_sine_coulomb_matrix(self):
    """
    Test SCM featurizer.
    """

    featurizer = SineCoulombMatrix(max_atoms=1)
    features = featurizer.featurize([self.struct_dict])

    assert len(features) == 1
    assert np.isclose(features[0], 1244, atol=.5)

  def test_structure_graph_featurizer(self):
    """
    Test StructureGraphFeaturizer.
    """

    featurizer = StructureGraphFeaturizer(radius=3.0, max_neighbors=6)
    features = featurizer.featurize([self.struct_dict])

    assert len(features[0]) == 3
    assert features[0][0] == 26
    assert features[0][1].shape == (6, 16)
+26 −0
Original line number Diff line number Diff line
@@ -116,6 +116,32 @@ AtomConvFeaturizer
.. autoclass:: deepchem.feat.NeighborListComplexAtomicCoordinates
  :members:

MaterialsFeaturizers
-------------------

Materials Featurizers are those that work with datasets of inorganic crystals.
These featurizers operate on chemical compositions (e.g. "MoS2"), or on a
lattice and 3D coordinates that specify a periodic crystal structure. They
should be applied on systems that have periodic boundary conditions. Materials
featurizers are not designed to work with molecules. 

ElementPropertyFingerprint
^^^^^^^^^^^^^^^^^^^

.. autoclass:: deepchem.feat.ElementPropertyFingerprint
  :members:

SineCoulombMatrix
^^^^^^^^^^^^^^^^^

.. autoclass:: deepchem.feat.SineCoulombMatrix
  :members:

StructureGraphFeaturizer
^^^^^^^^^^^^^^^^^^^^^^^^

.. autoclass:: deepchem.feat.StructureGraphFeaturizer
  :members:

BindingPocketFeaturizer
-----------------------
Loading