Update docs to change GPU-direct to CUDA-aware MPI (4760cf86) · Commits · 郑智淋 / lammps

doc/src/Speed_kokkos.txt

+11 −13

Original line number	Diff line number	Diff line
		@@ -46,16 +46,15 @@ software version 7.5 or later must be installed on your system. See
		the discussion for the "GPU package"_Speed_gpu.html for details of how
		to check and do this.

		NOTE: Kokkos with CUDA currently implicitly assumes that the MPI
		library is CUDA-aware and has support for GPU-direct. This is not
		always the case, especially when using pre-compiled MPI libraries
		provided by a Linux distribution. This is not a problem when using
		only a single GPU and a single MPI rank on a desktop. When running
		with multiple MPI ranks, you may see segmentation faults without
		GPU-direct support. These can be avoided by adding the flags "-pk
		kokkos gpu/direct off"_Run_options.html to the LAMMPS command line or
		by using the command "package kokkos gpu/direct off"_package.html in
		the input file.
		NOTE: Kokkos with CUDA currently implicitly assumes that the MPI library
		is CUDA-aware. This is not always the case, especially when using
		pre-compiled MPI libraries provided by a Linux distribution. This is not
		a problem when using only a single GPU with a single MPI rank. When
		running with multiple MPI ranks, you may see segmentation faults without
		CUDA-aware MPI support. These can be avoided by adding the flags "-pk
		kokkos cuda/aware off"_Run_options.html to the LAMMPS command line or by
		using the command "package kokkos cuda/aware off"_package.html in the
		input file.

		[Building LAMMPS with the KOKKOS package:]

		@@ -217,9 +216,8 @@ case, also packing/unpacking communication buffers on the host may give
		speedup (see the KOKKOS "package"_package.html command). Using CUDA MPS
		is recommended in this scenario.

		Using a CUDA-aware MPI library with
		support for GPU-direct is highly recommended. GPU-direct use can be
		avoided by using "-pk kokkos gpu/direct no"_package.html. As above for
		Using a CUDA-aware MPI library is highly recommended. CUDA-aware MPI use can be
		avoided by using "-pk kokkos cuda/aware no"_package.html. As above for
		multi-core CPUs (and no GPU), if N is the number of physical cores/node,
		then the number of MPI tasks/node should not exceed N.

doc/src/package.txt

+15 −13

Original line number	Diff line number	Diff line
		@@ -64,7 +64,7 @@ args = arguments specific to the style :l
		{no_affinity} values = none
		{kokkos} args = keyword value ...
		zero or more keyword/value pairs may be appended
		keywords = {neigh} or {neigh/qeq} or {neigh/thread} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {gpu/direct}
		keywords = {neigh} or {neigh/qeq} or {neigh/thread} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {cuda/aware}
		{neigh} value = {full} or {half}
		full = full neighbor list
		half = half neighbor list built in thread-safe manner
		@@ -87,9 +87,9 @@ args = arguments specific to the style :l
		no = perform communication pack/unpack in non-KOKKOS mode
		host = perform pack/unpack on host (e.g. with OpenMP threading)
		device = perform pack/unpack on device (e.g. on GPU)
		{gpu/direct} = {off} or {on}
		off = do not use GPU-direct
		on = use GPU-direct (default)
		{cuda/aware} = {off} or {on}
		off = do not use CUDA-aware MPI
		on = use CUDA-aware MPI (default)
		{omp} args = Nthreads keyword value ...
		Nthread = # of OpenMP threads to associate with each MPI process
		zero or more keyword/value pairs may be appended
		@@ -520,19 +520,21 @@ pack/unpack communicated data. When running small systems on a GPU,
		performing the exchange pack/unpack on the host CPU can give speedup
		since it reduces the number of CUDA kernel launches.

		The {gpu/direct} keyword chooses whether GPU-direct will be used. When
		The {cuda/aware} keyword chooses whether CUDA-aware MPI will be used. When
		this keyword is set to {on}, buffers in GPU memory are passed directly
		through MPI send/receive calls. This reduces overhead of first copying
		the data to the host CPU. However GPU-direct is not supported on all
		the data to the host CPU. However CUDA-aware MPI is not supported on all
		systems, which can lead to segmentation faults and would require using a
		value of {off}. If LAMMPS can safely detect that GPU-direct is not
		value of {off}. If LAMMPS can safely detect that CUDA-aware MPI is not
		available (currently only possible with OpenMPI v2.0.0 or later), then
		the {gpu/direct} keyword is automatically set to {off} by default. When
		the {gpu/direct} keyword is set to {off} while any of the {comm}
		the {cuda/aware} keyword is automatically set to {off} by default. When
		the {cuda/aware} keyword is set to {off} while any of the {comm}
		keywords are set to {device}, the value for these {comm} keywords will
		be automatically changed to {host}. This setting has no effect if not
		running on GPUs. GPU-direct is available for OpenMPI 1.8 (or later
		versions), Mvapich2 1.9 (or later), and CrayMPI.
		running on GPUs. CUDA-aware MPI is available for OpenMPI 1.8 (or later
		versions), Mvapich2 1.9 (or later) when the "MV2_USE_CUDA" environment
		variable is set to "1", CrayMPI, and IBM Spectrum MPI when the "-gpu"
		flag is used.

		:line

		@@ -641,8 +643,8 @@ switch"_Run_options.html.

		For the KOKKOS package, the option defaults for GPUs are neigh = full,
		neigh/qeq = full, newton = off, binsize for GPUs = 2x LAMMPS default
		value, comm = device, gpu/direct = on. When LAMMPS can safely detect
		that GPU-direct is not available, the default value of gpu/direct
		value, comm = device, cuda/aware = on. When LAMMPS can safely detect
		that CUDA-aware MPI is not available, the default value of cuda/aware
		becomes "off". For CPUs or Xeon Phis, the option defaults are neigh =
		half, neigh/qeq = half, newton = on, binsize = 0.0, and comm = no. The
		option neigh/thread = on when there are 16K atoms or less on an MPI

Admin message