Unverified Commit 9421466f authored by Axel Kohlmeyer's avatar Axel Kohlmeyer
Browse files

Merge branch 'master' into enh-ext-reaxc

Resolved Merge Conflict in src/KOKKOS/kokkos.cpp
parents fa764721 8d985e53
Loading
Loading
Loading
Loading
+16 −14
Original line number Diff line number Diff line
@@ -46,7 +46,7 @@ software version 7.5 or later must be installed on your system. See
the discussion for the "GPU package"_Speed_gpu.html for details of how
to check and do this.

NOTE: Kokkos with CUDA currently implicitly assumes, that the MPI
NOTE: Kokkos with CUDA currently implicitly assumes that the MPI
library is CUDA-aware and has support for GPU-direct. This is not
always the case, especially when using pre-compiled MPI libraries
provided by a Linux distribution. This is not a problem when using
@@ -207,19 +207,21 @@ supports.

[Running on GPUs:]

Use the "-k" "command-line switch"_Run_options.html to
specify the number of GPUs per node. Typically the -np setting of the
mpirun command should set the number of MPI tasks/node to be equal to
the number of physical GPUs on the node.  You can assign multiple MPI
tasks to the same GPU with the KOKKOS package, but this is usually
only faster if significant portions of the input script have not
been ported to use Kokkos. Using CUDA MPS is recommended in this
scenario. Using a CUDA-aware MPI library with support for GPU-direct
is highly recommended. GPU-direct use can be avoided by using
"-pk kokkos gpu/direct no"_package.html.
As above for multi-core CPUs (and no GPU), if N is the number of
physical cores/node, then the number of MPI tasks/node should not
exceed N.
Use the "-k" "command-line switch"_Run_options.html to specify the 
number of GPUs per node. Typically the -np setting of the mpirun command 
should set the number of MPI tasks/node to be equal to the number of 
physical GPUs on the node. You can assign multiple MPI tasks to the same 
GPU with the KOKKOS package, but this is usually only faster if some 
portions of the input script have not been ported to use Kokkos. In this 
case, also packing/unpacking communication buffers on the host may give 
speedup (see the KOKKOS "package"_package.html command). Using CUDA MPS 
is recommended in this scenario.

Using a CUDA-aware MPI library with 
support for GPU-direct is highly recommended. GPU-direct use can be 
avoided by using "-pk kokkos gpu/direct no"_package.html. As above for 
multi-core CPUs (and no GPU), if N is the number of physical cores/node, 
then the number of MPI tasks/node should not exceed N.

-k on g Ng :pre

+37 −20
Original line number Diff line number Diff line
@@ -64,13 +64,16 @@ args = arguments specific to the style :l
      {no_affinity} values = none
  {kokkos} args = keyword value ...
    zero or more keyword/value pairs may be appended
    keywords = {neigh} or {neigh/qeq} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {gpu/direct}
    keywords = {neigh} or {neigh/qeq} or {neigh/thread} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {gpu/direct}
      {neigh} value = {full} or {half}
        full = full neighbor list
        half = half neighbor list built in thread-safe manner
      {neigh/qeq} value = {full} or {half}
        full = full neighbor list
        half = half neighbor list built in thread-safe manner
      {neigh/thread} value = {off} or {on}
        off = thread only over atoms
        on = thread over both atoms and neighbors
      {newton} = {off} or {on}
        off = set Newton pairwise and bonded flags off
        on = set Newton pairwise and bonded flags on
@@ -444,6 +447,18 @@ the {neigh/qeq} keyword determines how neighbor lists are built for "fix
qeq/reax/kk"_fix_qeq_reax.html. If not explicitly set, the value of 
{neigh/qeq} will match {neigh}.

If the {neigh/thread} keyword is set to {off}, then the KOKKOS package 
threads only over atoms. However, for small systems, this may not expose 
enough parallelism to keep a GPU busy. When this keyword is set to {on}, 
the KOKKOS package threads over both atoms and neighbors of atoms. When 
using {neigh/thread} {on}, a full neighbor list must also be used. Using 
{neigh/thread} {on} may be slower for large systems, so this this option 
is turned on by default only when there are 16K atoms or less owned by 
an MPI rank and when using a full neighbor list. Not all KOKKOS-enabled 
potentials support this keyword yet, and only thread over atoms. Many 
simple pair-wise potentials such as Lennard-Jones do support threading 
over both atoms and neighbors.

The {newton} keyword sets the Newton flags for pairwise and bonded 
interactions to {off} or {on}, the same as the "newton"_newton.html 
command allows. The default for GPUs is {off} because this will almost 
@@ -498,14 +513,14 @@ identically. When using GPUs, the {device} value is the default since it
will typically be optimal if all of your styles used in your input 
script are supported by the KOKKOS package. In this case data can stay 
on the GPU for many timesteps without being moved between the host and 
GPU, if you use the {device} value. This requires that your MPI is able 
to access GPU memory directly. Currently that is true for OpenMPI 1.8 
(or later versions), Mvapich2 1.9 (or later), and CrayMPI. If your 
script uses styles (e.g. fixes) which are not yet supported by the 
KOKKOS package, then data has to be move between the host and device 
anyway, so it is typically faster to let the host handle communication, 
by using the {host} value. Using {host} instead of {no} will enable use 
of multiple threads to pack/unpack communicated data. 
GPU, if you use the {device} value. If your script uses styles (e.g. 
fixes) which are not yet supported by the KOKKOS package, then data has 
to be move between the host and device anyway, so it is typically faster 
to let the host handle communication, by using the {host} value. Using 
{host} instead of {no} will enable use of multiple threads to 
pack/unpack communicated data. When running small systems on a GPU, 
performing the exchange pack/unpack on the host CPU can give speedup 
since it reduces the number of CUDA kernel launches.

The {gpu/direct} keyword chooses whether GPU-direct will be used. When 
this keyword is set to {on}, buffers in GPU memory are passed directly 
@@ -518,7 +533,8 @@ the {gpu/direct} keyword is automatically set to {off} by default. When
the {gpu/direct} keyword is set to {off} while any of the {comm} 
keywords are set to {device}, the value for these {comm} keywords will 
be automatically changed to {host}. This setting has no effect if not 
running on GPUs.
running on GPUs. GPU-direct is available for OpenMPI 1.8 (or later 
versions), Mvapich2 1.9 (or later), and CrayMPI.

:line

@@ -630,11 +646,12 @@ neigh/qeq = full, newton = off, binsize for GPUs = 2x LAMMPS default
value, comm = device, gpu/direct = on. When LAMMPS can safely detect 
that GPU-direct is not available, the default value of gpu/direct 
becomes "off". For CPUs or Xeon Phis, the option defaults are neigh = 
half, neigh/qeq = half, newton = on, binsize = 0.0, and comm = no. These 
settings are made automatically by the required "-k on" "command-line 
switch"_Run_options.html. You can change them by using the package 
kokkos command in your input script or via the "-pk kokkos command-line 
switch"_Run_options.html.
half, neigh/qeq = half, newton = on, binsize = 0.0, and comm = no. The 
option neigh/thread = on when there are 16K atoms or less on an MPI 
rank, otherwise it is "off". These settings are made automatically by 
the required "-k on" "command-line switch"_Run_options.html. You can 
change them by using the package kokkos command in your input script or 
via the "-pk kokkos command-line switch"_Run_options.html.

For the OMP package, the default is Nthreads = 0 and the option
defaults are neigh = yes.  These settings are made automatically if
+3 −0
Original line number Diff line number Diff line
@@ -22,6 +22,7 @@
#include "memory_kokkos.h"
#include "error.h"
#include "kokkos.h"
#include "atom_masks.h"

using namespace LAMMPS_NS;

@@ -270,8 +271,10 @@ int AtomKokkos::add_custom(const char *name, int flag)
    int n = strlen(name) + 1;
    dname[index] = new char[n];
    strcpy(dname[index],name);
    this->sync(Device,DVECTOR_MASK);
    memoryKK->grow_kokkos(k_dvector,dvector,ndvector,nmax,
                        "atom:dvector");
    this->modified(Device,DVECTOR_MASK);
  }

  return index;
+25 −24
Original line number Diff line number Diff line
@@ -24,7 +24,7 @@

using namespace LAMMPS_NS;

#define DELTA 10000
#define DELTA 10

/* ---------------------------------------------------------------------- */

@@ -59,14 +59,15 @@ AtomVecAngleKokkos::AtomVecAngleKokkos(LAMMPS *lmp) : AtomVecKokkos(lmp)

void AtomVecAngleKokkos::grow(int n)
{
  if (n == 0) nmax += DELTA;
  int step = MAX(DELTA,nmax*0.01);
  if (n == 0) nmax += step;
  else nmax = n;
  atomKK->nmax = nmax;
  if (nmax < 0 || nmax > MAXSMALLINT)
    error->one(FLERR,"Per-processor system is too big");

  sync(Device,ALL_MASK);
  modified(Device,ALL_MASK);
  atomKK->sync(Device,ALL_MASK);
  atomKK->modified(Device,ALL_MASK);

  memoryKK->grow_kokkos(atomKK->k_tag,atomKK->tag,nmax,"atom:tag");
  memoryKK->grow_kokkos(atomKK->k_type,atomKK->type,nmax,"atom:type");
@@ -98,7 +99,7 @@ void AtomVecAngleKokkos::grow(int n)
                      "atom:angle_atom3");

  grow_reset();
  sync(Host,ALL_MASK);
  atomKK->sync(Host,ALL_MASK);

  if (atom->nextra_grow)
    for (int iextra = 0; iextra < atom->nextra_grow; iextra++)
@@ -282,7 +283,7 @@ int AtomVecAngleKokkos::pack_comm_kokkos(const int &n,
  // Choose correct forward PackComm kernel

  if(commKK->forward_comm_on_host) {
    sync(Host,X_MASK);
    atomKK->sync(Host,X_MASK);
    if(pbc_flag) {
      if(domain->triclinic) {
        struct AtomVecAngleKokkos_PackComm<LMPHostType,1,1> f(atomKK->k_x,buf,list,iswap,
@@ -309,7 +310,7 @@ int AtomVecAngleKokkos::pack_comm_kokkos(const int &n,
      }
    }
  } else {
    sync(Device,X_MASK);
    atomKK->sync(Device,X_MASK);
    if(pbc_flag) {
      if(domain->triclinic) {
        struct AtomVecAngleKokkos_PackComm<LMPDeviceType,1,1> f(atomKK->k_x,buf,list,iswap,
@@ -397,8 +398,8 @@ int AtomVecAngleKokkos::pack_comm_self(const int &n, const DAT::tdual_int_2d &li
                                       const int nfirst, const int &pbc_flag,
                                       const int* const pbc) {
  if(commKK->forward_comm_on_host) {
    sync(Host,X_MASK);
    modified(Host,X_MASK);
    atomKK->sync(Host,X_MASK);
    atomKK->modified(Host,X_MASK);
    if(pbc_flag) {
      if(domain->triclinic) {
      struct AtomVecAngleKokkos_PackCommSelf<LMPHostType,1,1>
@@ -429,8 +430,8 @@ int AtomVecAngleKokkos::pack_comm_self(const int &n, const DAT::tdual_int_2d &li
      }
    }
  } else {
    sync(Device,X_MASK);
    modified(Device,X_MASK);
    atomKK->sync(Device,X_MASK);
    atomKK->modified(Device,X_MASK);
    if(pbc_flag) {
      if(domain->triclinic) {
      struct AtomVecAngleKokkos_PackCommSelf<LMPDeviceType,1,1>
@@ -493,13 +494,13 @@ struct AtomVecAngleKokkos_UnpackComm {
void AtomVecAngleKokkos::unpack_comm_kokkos(const int &n, const int &first,
    const DAT::tdual_xfloat_2d &buf ) {
  if(commKK->forward_comm_on_host) {
    sync(Host,X_MASK);
    modified(Host,X_MASK);
    atomKK->sync(Host,X_MASK);
    atomKK->modified(Host,X_MASK);
    struct AtomVecAngleKokkos_UnpackComm<LMPHostType> f(atomKK->k_x,buf,first);
    Kokkos::parallel_for(n,f);
  } else {
    sync(Device,X_MASK);
    modified(Device,X_MASK);
    atomKK->sync(Device,X_MASK);
    atomKK->modified(Device,X_MASK);
    struct AtomVecAngleKokkos_UnpackComm<LMPDeviceType> f(atomKK->k_x,buf,first);
    Kokkos::parallel_for(n,f);
  }
@@ -642,7 +643,7 @@ void AtomVecAngleKokkos::unpack_comm_vel(int n, int first, double *buf)
int AtomVecAngleKokkos::pack_reverse(int n, int first, double *buf)
{
  if(n > 0)
    sync(Host,F_MASK);
    atomKK->sync(Host,F_MASK);

  int m = 0;
  const int last = first + n;
@@ -659,7 +660,7 @@ int AtomVecAngleKokkos::pack_reverse(int n, int first, double *buf)
void AtomVecAngleKokkos::unpack_reverse(int n, int *list, double *buf)
{
  if(n > 0)
    modified(Host,F_MASK);
    atomKK->modified(Host,F_MASK);

  int m = 0;
  for (int i = 0; i < n; i++) {
@@ -960,9 +961,9 @@ struct AtomVecAngleKokkos_UnpackBorder {
void AtomVecAngleKokkos::unpack_border_kokkos(const int &n, const int &first,
                                             const DAT::tdual_xfloat_2d &buf,
                                             ExecutionSpace space) {
  modified(space,X_MASK|TAG_MASK|TYPE_MASK|MASK_MASK|MOLECULE_MASK);
  atomKK->modified(space,X_MASK|TAG_MASK|TYPE_MASK|MASK_MASK|MOLECULE_MASK);
  while (first+n >= nmax) grow(0);
  modified(space,X_MASK|TAG_MASK|TYPE_MASK|MASK_MASK|MOLECULE_MASK);
  atomKK->modified(space,X_MASK|TAG_MASK|TYPE_MASK|MASK_MASK|MOLECULE_MASK);
  if(space==Host) {
    struct AtomVecAngleKokkos_UnpackBorder<LMPHostType>
      f(buf.view<LMPHostType>(),h_x,h_tag,h_type,h_mask,h_molecule,first);
@@ -984,7 +985,7 @@ void AtomVecAngleKokkos::unpack_border(int n, int first, double *buf)
  last = first + n;
  for (i = first; i < last; i++) {
    if (i == nmax) grow(0);
    modified(Host,X_MASK|TAG_MASK|TYPE_MASK|MASK_MASK|MOLECULE_MASK);
    atomKK->modified(Host,X_MASK|TAG_MASK|TYPE_MASK|MASK_MASK|MOLECULE_MASK);
    h_x(i,0) = buf[m++];
    h_x(i,1) = buf[m++];
    h_x(i,2) = buf[m++];
@@ -1010,7 +1011,7 @@ void AtomVecAngleKokkos::unpack_border_vel(int n, int first, double *buf)
  last = first + n;
  for (i = first; i < last; i++) {
    if (i == nmax) grow(0);
    modified(Host,X_MASK|V_MASK|TAG_MASK|TYPE_MASK|MASK_MASK|MOLECULE_MASK);
    atomKK->modified(Host,X_MASK|V_MASK|TAG_MASK|TYPE_MASK|MASK_MASK|MOLECULE_MASK);
    h_x(i,0) = buf[m++];
    h_x(i,1) = buf[m++];
    h_x(i,2) = buf[m++];
@@ -1412,7 +1413,7 @@ int AtomVecAngleKokkos::unpack_exchange(double *buf)
{
  int nlocal = atom->nlocal;
  if (nlocal == nmax) grow(0);
  modified(Host,X_MASK | V_MASK | TAG_MASK | TYPE_MASK |
  atomKK->modified(Host,X_MASK | V_MASK | TAG_MASK | TYPE_MASK |
           MASK_MASK | IMAGE_MASK | MOLECULE_MASK | BOND_MASK |
           ANGLE_MASK | SPECIAL_MASK);

@@ -1487,7 +1488,7 @@ int AtomVecAngleKokkos::size_restart()

int AtomVecAngleKokkos::pack_restart(int i, double *buf)
{
  sync(Host,X_MASK | V_MASK | TAG_MASK | TYPE_MASK |
  atomKK->sync(Host,X_MASK | V_MASK | TAG_MASK | TYPE_MASK |
            MASK_MASK | IMAGE_MASK | MOLECULE_MASK | BOND_MASK |
            ANGLE_MASK | SPECIAL_MASK);

@@ -1541,7 +1542,7 @@ int AtomVecAngleKokkos::unpack_restart(double *buf)
    if (atom->nextra_store)
      memory->grow(atom->extra,nmax,atom->nextra_store,"atom:extra");
  }
  modified(Host,X_MASK | V_MASK | TAG_MASK | TYPE_MASK |
  atomKK->modified(Host,X_MASK | V_MASK | TAG_MASK | TYPE_MASK |
                MASK_MASK | IMAGE_MASK | MOLECULE_MASK | BOND_MASK |
                ANGLE_MASK | SPECIAL_MASK);

+13 −12
Original line number Diff line number Diff line
@@ -24,7 +24,7 @@

using namespace LAMMPS_NS;

#define DELTA 10000
#define DELTA 10

/* ---------------------------------------------------------------------- */

@@ -55,14 +55,15 @@ AtomVecAtomicKokkos::AtomVecAtomicKokkos(LAMMPS *lmp) : AtomVecKokkos(lmp)

void AtomVecAtomicKokkos::grow(int n)
{
  if (n == 0) nmax += DELTA;
  int step = MAX(DELTA,nmax*0.01);
  if (n == 0) nmax += step;
  else nmax = n;
  atomKK->nmax = nmax;
  if (nmax < 0 || nmax > MAXSMALLINT)
    error->one(FLERR,"Per-processor system is too big");

  sync(Device,ALL_MASK);
  modified(Device,ALL_MASK);
  atomKK->sync(Device,ALL_MASK);
  atomKK->modified(Device,ALL_MASK);

  memoryKK->grow_kokkos(atomKK->k_tag,atomKK->tag,nmax,"atom:tag");
  memoryKK->grow_kokkos(atomKK->k_type,atomKK->type,nmax,"atom:type");
@@ -74,7 +75,7 @@ void AtomVecAtomicKokkos::grow(int n)
  memoryKK->grow_kokkos(atomKK->k_f,atomKK->f,nmax,3,"atom:f");

  grow_reset();
  sync(Host,ALL_MASK);
  atomKK->sync(Host,ALL_MASK);

  if (atom->nextra_grow)
    for (int iextra = 0; iextra < atom->nextra_grow; iextra++)
@@ -393,9 +394,9 @@ struct AtomVecAtomicKokkos_UnpackBorder {

void AtomVecAtomicKokkos::unpack_border_kokkos(const int &n, const int &first,
                     const DAT::tdual_xfloat_2d &buf,ExecutionSpace space) {
  modified(space,X_MASK|TAG_MASK|TYPE_MASK|MASK_MASK);
  atomKK->modified(space,X_MASK|TAG_MASK|TYPE_MASK|MASK_MASK);
  while (first+n >= nmax) grow(0);
  modified(space,X_MASK|TAG_MASK|TYPE_MASK|MASK_MASK);
  atomKK->modified(space,X_MASK|TAG_MASK|TYPE_MASK|MASK_MASK);
  if(space==Host) {
    struct AtomVecAtomicKokkos_UnpackBorder<LMPHostType> f(buf.view<LMPHostType>(),h_x,h_tag,h_type,h_mask,first);
    Kokkos::parallel_for(n,f);
@@ -415,7 +416,7 @@ void AtomVecAtomicKokkos::unpack_border(int n, int first, double *buf)
  last = first + n;
  for (i = first; i < last; i++) {
    if (i == nmax) grow(0);
    modified(Host,X_MASK|TAG_MASK|TYPE_MASK|MASK_MASK);
    atomKK->modified(Host,X_MASK|TAG_MASK|TYPE_MASK|MASK_MASK);
    h_x(i,0) = buf[m++];
    h_x(i,1) = buf[m++];
    h_x(i,2) = buf[m++];
@@ -440,7 +441,7 @@ void AtomVecAtomicKokkos::unpack_border_vel(int n, int first, double *buf)
  last = first + n;
  for (i = first; i < last; i++) {
    if (i == nmax) grow(0);
    modified(Host,X_MASK|V_MASK|TAG_MASK|TYPE_MASK|MASK_MASK);
    atomKK->modified(Host,X_MASK|V_MASK|TAG_MASK|TYPE_MASK|MASK_MASK);
    h_x(i,0) = buf[m++];
    h_x(i,1) = buf[m++];
    h_x(i,2) = buf[m++];
@@ -668,7 +669,7 @@ int AtomVecAtomicKokkos::unpack_exchange(double *buf)
{
  int nlocal = atom->nlocal;
  if (nlocal == nmax) grow(0);
  modified(Host,X_MASK | V_MASK | TAG_MASK | TYPE_MASK |
  atomKK->modified(Host,X_MASK | V_MASK | TAG_MASK | TYPE_MASK |
           MASK_MASK | IMAGE_MASK);

  int m = 1;
@@ -720,7 +721,7 @@ int AtomVecAtomicKokkos::size_restart()

int AtomVecAtomicKokkos::pack_restart(int i, double *buf)
{
  sync(Host,X_MASK | V_MASK | TAG_MASK | TYPE_MASK |
  atomKK->sync(Host,X_MASK | V_MASK | TAG_MASK | TYPE_MASK |
            MASK_MASK | IMAGE_MASK );

  int m = 1;
@@ -755,7 +756,7 @@ int AtomVecAtomicKokkos::unpack_restart(double *buf)
    if (atom->nextra_store)
      memory->grow(atom->extra,nmax,atom->nextra_store,"atom:extra");
  }
  modified(Host,X_MASK | V_MASK | TAG_MASK | TYPE_MASK |
  atomKK->modified(Host,X_MASK | V_MASK | TAG_MASK | TYPE_MASK |
                MASK_MASK | IMAGE_MASK );

  int m = 1;
Loading