Commit 073f0034 authored by Stan Moore's avatar Stan Moore
Browse files

Doc tweak

parent 618547b7
Loading
Loading
Loading
Loading
+16 −14
Original line number Diff line number Diff line
@@ -46,7 +46,7 @@ software version 7.5 or later must be installed on your system. See
the discussion for the "GPU package"_Speed_gpu.html for details of how
to check and do this.

NOTE: Kokkos with CUDA currently implicitly assumes, that the MPI
NOTE: Kokkos with CUDA currently implicitly assumes that the MPI
library is CUDA-aware and has support for GPU-direct. This is not
always the case, especially when using pre-compiled MPI libraries
provided by a Linux distribution. This is not a problem when using
@@ -207,19 +207,21 @@ supports.

[Running on GPUs:]

Use the "-k" "command-line switch"_Run_options.html to
specify the number of GPUs per node. Typically the -np setting of the
mpirun command should set the number of MPI tasks/node to be equal to
the number of physical GPUs on the node.  You can assign multiple MPI
tasks to the same GPU with the KOKKOS package, but this is usually
only faster if significant portions of the input script have not
been ported to use Kokkos. Using CUDA MPS is recommended in this
scenario. Using a CUDA-aware MPI library with support for GPU-direct
is highly recommended. GPU-direct use can be avoided by using
"-pk kokkos gpu/direct no"_package.html.
As above for multi-core CPUs (and no GPU), if N is the number of
physical cores/node, then the number of MPI tasks/node should not
exceed N.
Use the "-k" "command-line switch"_Run_options.html to specify the 
number of GPUs per node. Typically the -np setting of the mpirun command 
should set the number of MPI tasks/node to be equal to the number of 
physical GPUs on the node. You can assign multiple MPI tasks to the same 
GPU with the KOKKOS package, but this is usually only faster if some 
portions of the input script have not been ported to use Kokkos. In this 
case, also packing/unpacking communication buffers on the host may give 
speedup (see the KOKKOS "package"_package.html command). Using CUDA MPS 
is recommended in this scenario.

Using a CUDA-aware MPI library with 
support for GPU-direct is highly recommended. GPU-direct use can be 
avoided by using "-pk kokkos gpu/direct no"_package.html. As above for 
multi-core CPUs (and no GPU), if N is the number of physical cores/node, 
then the number of MPI tasks/node should not exceed N.

-k on g Ng :pre

+14 −13
Original line number Diff line number Diff line
@@ -513,14 +513,14 @@ identically. When using GPUs, the {device} value is the default since it
will typically be optimal if all of your styles used in your input 
script are supported by the KOKKOS package. In this case data can stay 
on the GPU for many timesteps without being moved between the host and 
GPU, if you use the {device} value. This requires that your MPI is able 
to access GPU memory directly. Currently that is true for OpenMPI 1.8 
(or later versions), Mvapich2 1.9 (or later), and CrayMPI. If your 
script uses styles (e.g. fixes) which are not yet supported by the 
KOKKOS package, then data has to be move between the host and device 
anyway, so it is typically faster to let the host handle communication, 
by using the {host} value. Using {host} instead of {no} will enable use 
of multiple threads to pack/unpack communicated data. 
GPU, if you use the {device} value. If your script uses styles (e.g. 
fixes) which are not yet supported by the KOKKOS package, then data has 
to be move between the host and device anyway, so it is typically faster 
to let the host handle communication, by using the {host} value. Using 
{host} instead of {no} will enable use of multiple threads to 
pack/unpack communicated data. When running small systems on a GPU, 
performing the exchange pack/unpack on the host CPU can give speedup 
since it reduces the number of CUDA kernel launches.

The {gpu/direct} keyword chooses whether GPU-direct will be used. When 
this keyword is set to {on}, buffers in GPU memory are passed directly 
@@ -533,7 +533,8 @@ the {gpu/direct} keyword is automatically set to {off} by default. When
the {gpu/direct} keyword is set to {off} while any of the {comm} 
keywords are set to {device}, the value for these {comm} keywords will 
be automatically changed to {host}. This setting has no effect if not 
running on GPUs.
running on GPUs. GPU-direct is available for OpenMPI 1.8 (or later 
versions), Mvapich2 1.9 (or later), and CrayMPI.

:line