Commit 4760cf86 authored by Stan Moore's avatar Stan Moore
Browse files

Update docs to change GPU-direct to CUDA-aware MPI

parent 44002577
Loading
Loading
Loading
Loading
+11 −13
Original line number Diff line number Diff line
@@ -46,16 +46,15 @@ software version 7.5 or later must be installed on your system. See
the discussion for the "GPU package"_Speed_gpu.html for details of how
to check and do this.

NOTE: Kokkos with CUDA currently implicitly assumes that the MPI
library is CUDA-aware and has support for GPU-direct. This is not
always the case, especially when using pre-compiled MPI libraries
provided by a Linux distribution. This is not a problem when using
only a single GPU and a single MPI rank on a desktop. When running
with multiple MPI ranks, you may see segmentation faults without
GPU-direct support.  These can be avoided by adding the flags "-pk
kokkos gpu/direct off"_Run_options.html to the LAMMPS command line or
by using the command "package kokkos gpu/direct off"_package.html in
the input file.
NOTE: Kokkos with CUDA currently implicitly assumes that the MPI library 
is CUDA-aware. This is not always the case, especially when using 
pre-compiled MPI libraries provided by a Linux distribution. This is not 
a problem when using only a single GPU with a single MPI rank. When 
running with multiple MPI ranks, you may see segmentation faults without 
CUDA-aware MPI support. These can be avoided by adding the flags "-pk 
kokkos cuda/aware off"_Run_options.html to the LAMMPS command line or by 
using the command "package kokkos cuda/aware off"_package.html in the 
input file.

[Building LAMMPS with the KOKKOS package:]

@@ -217,9 +216,8 @@ case, also packing/unpacking communication buffers on the host may give
speedup (see the KOKKOS "package"_package.html command). Using CUDA MPS 
is recommended in this scenario.

Using a CUDA-aware MPI library with 
support for GPU-direct is highly recommended. GPU-direct use can be 
avoided by using "-pk kokkos gpu/direct no"_package.html. As above for 
Using a CUDA-aware MPI library is highly recommended. CUDA-aware MPI use can be 
avoided by using "-pk kokkos cuda/aware no"_package.html. As above for 
multi-core CPUs (and no GPU), if N is the number of physical cores/node, 
then the number of MPI tasks/node should not exceed N.

+15 −13
Original line number Diff line number Diff line
@@ -64,7 +64,7 @@ args = arguments specific to the style :l
      {no_affinity} values = none
  {kokkos} args = keyword value ...
    zero or more keyword/value pairs may be appended
    keywords = {neigh} or {neigh/qeq} or {neigh/thread} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {gpu/direct}
    keywords = {neigh} or {neigh/qeq} or {neigh/thread} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {cuda/aware}
      {neigh} value = {full} or {half}
        full = full neighbor list
        half = half neighbor list built in thread-safe manner
@@ -87,9 +87,9 @@ args = arguments specific to the style :l
        no = perform communication pack/unpack in non-KOKKOS mode
        host = perform pack/unpack on host (e.g. with OpenMP threading)
        device = perform pack/unpack on device (e.g. on GPU)
      {gpu/direct} = {off} or {on}
        off = do not use GPU-direct
        on = use GPU-direct (default)
      {cuda/aware} = {off} or {on}
        off = do not use CUDA-aware MPI
        on = use CUDA-aware MPI (default)
  {omp} args = Nthreads keyword value ...
    Nthread = # of OpenMP threads to associate with each MPI process
    zero or more keyword/value pairs may be appended
@@ -520,19 +520,21 @@ pack/unpack communicated data. When running small systems on a GPU,
performing the exchange pack/unpack on the host CPU can give speedup 
since it reduces the number of CUDA kernel launches.

The {gpu/direct} keyword chooses whether GPU-direct will be used. When 
The {cuda/aware} keyword chooses whether CUDA-aware MPI will be used. When 
this keyword is set to {on}, buffers in GPU memory are passed directly 
through MPI send/receive calls. This reduces overhead of first copying 
the data to the host CPU. However GPU-direct is not supported on all 
the data to the host CPU. However CUDA-aware MPI is not supported on all 
systems, which can lead to segmentation faults and would require using a 
value of {off}. If LAMMPS can safely detect that GPU-direct is not 
value of {off}. If LAMMPS can safely detect that CUDA-aware MPI is not 
available (currently only possible with OpenMPI v2.0.0 or later), then 
the {gpu/direct} keyword is automatically set to {off} by default. When 
the {gpu/direct} keyword is set to {off} while any of the {comm} 
the {cuda/aware} keyword is automatically set to {off} by default. When 
the {cuda/aware} keyword is set to {off} while any of the {comm} 
keywords are set to {device}, the value for these {comm} keywords will 
be automatically changed to {host}. This setting has no effect if not 
running on GPUs. GPU-direct is available for OpenMPI 1.8 (or later 
versions), Mvapich2 1.9 (or later), and CrayMPI.
running on GPUs. CUDA-aware MPI is available for OpenMPI 1.8 (or later 
versions), Mvapich2 1.9 (or later) when the "MV2_USE_CUDA" environment
variable is set to "1", CrayMPI, and IBM Spectrum MPI when the "-gpu"
flag is used.

:line

@@ -641,8 +643,8 @@ switch"_Run_options.html.

For the KOKKOS package, the option defaults for GPUs are neigh = full, 
neigh/qeq = full, newton = off, binsize for GPUs = 2x LAMMPS default 
value, comm = device, gpu/direct = on. When LAMMPS can safely detect 
that GPU-direct is not available, the default value of gpu/direct 
value, comm = device, cuda/aware = on. When LAMMPS can safely detect 
that CUDA-aware MPI is not available, the default value of cuda/aware 
becomes "off". For CPUs or Xeon Phis, the option defaults are neigh = 
half, neigh/qeq = half, newton = on, binsize = 0.0, and comm = no. The 
option neigh/thread = on when there are 16K atoms or less on an MPI