update Kokkos related documentation for Kokkos 3.1 and refactor build info a bit (3c144600) · Commits · 郑智淋 / lammps

doc/src/Build_extras.rst

+50 −7

Original line number	Diff line number	Diff line
		@@ -440,8 +440,8 @@ be specified in uppercase.
		- GPU
		- AMD GPU MI50/MI60 GFX906

		CMake build settings:
		^^^^^^^^^^^^^^^^^^^^^
		Basic CMake build settings:
		^^^^^^^^^^^^^^^^^^^^^^^^^^^
		For multicore CPUs using OpenMP, set these 2 variables.

		.. code-block:: bash
		@@ -470,9 +470,13 @@ For NVIDIA GPUs using CUDA, set these variables:
		-D Kokkos_ENABLE_OPENMP=yes
		-D CMAKE_CXX_COMPILER=wrapper # wrapper = full path to Cuda nvcc wrapper

		The wrapper value is the Cuda nvcc compiler wrapper provided in the
		Kokkos library: ``lib/kokkos/bin/nvcc_wrapper``\ . The setting should
		include the full path name to the wrapper, e.g.
		This will also enable executing FFTs on the GPU, either via the internal
		KISSFFT library, or - by preference - with the cuFFT library bundled
		with the CUDA toolkit, depending on whether CMake can identify its
		location. The wrapper value for ``CMAKE_CXX_COMPILER`` variable is
		the path to the CUDA nvcc compiler wrapper provided in the Kokkos
		library: ``lib/kokkos/bin/nvcc_wrapper``\ . The setting should include
		the full path name to the wrapper, e.g.

		.. code-block:: bash

		@@ -492,8 +496,8 @@ common packages enabled, you can do the following:
		cmake -C ../cmake/presets/minimal.cmake -C ../cmake/presets/kokkos-cuda.cmake ../cmake
		cmake --build .

		Traditional make settings:
		^^^^^^^^^^^^^^^^^^^^^^^^^^
		Basic traditional make settings:
		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

		Choose which hardware to support in ``Makefile.machine`` via
		``KOKKOS_DEVICES`` and ``KOKKOS_ARCH`` settings. See the
		@@ -519,6 +523,7 @@ For NVIDIA GPUs using CUDA:

		KOKKOS_DEVICES = Cuda
		KOKKOS_ARCH = HOSTARCH,GPUARCH # HOSTARCH = HOST from list above that is hosting the GPU
		KOKKOS_CUDA_OPTIONS = "enable_lambda"
		# GPUARCH = GPU from list above
		FFT_INC = -DFFT_CUFFT # enable use of cuFFT (optional)
		FFT_LIB = -lcufft # link to cuFFT library
		@@ -541,6 +546,44 @@ C++ compiler for non-Kokkos, non-CUDA files.
		KOKKOS_ABSOLUTE_PATH = $(shell cd $(KOKKOS_PATH); pwd)
		CC = mpicxx -cxx=$(KOKKOS_ABSOLUTE_PATH)/config/nvcc_wrapper


		Advanced KOKKOS compilation settings
		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

		There are other allowed options when building with the KOKKOS package
		that can improve performance or assist in debugging or profiling. Below
		are some examples that may be useful in combination with LAMMPS. For
		the full list (which keeps changing as the Kokkos package itself evolves),
		please consult the Kokkos library documentation.

		As alternative to using multi-threading via OpenMP
		(``-DKokkos_ENABLE_OPENMP=on`` or ``KOKKOS_DEVICES=OpenMP``) it is also
		possible to use Posix threads directly (``-DKokkos_ENABLE_PTHREAD=on``
		or ``KOKKOS_DEVICES=Pthread``). While binding of threads to individual
		or groups of CPU cores is managed in OpenMP with environment variables,
		you need assistance from either the "hwloc" or "libnuma" library for the
		Pthread thread parallelization option. To enable use with CMake:
		``-DKokkos_ENABLE_HWLOC=on`` or ``-DKokkos_ENABLE_LIBNUMA=on``; and with
		conventional make: ``KOKKOS_USE_TPLS=hwloc`` or
		``KOKKOS_USE_TPLS=libnuma``.

		The CMake option ``-DKokkos_ENABLE_LIBRT=on`` or the makefile setting
		``KOKKOS_USE_TPLS=librt`` enables the use of a more accurate timer
		mechanism on many Unix-like platforms for internal profiling.

		The CMake option ``-DKokkos_ENABLE_DEBUG=on`` or the makefile setting
		``KOKKOS_DEBUG=yes`` enables printing of run-time
		debugging information that can be useful. It also enables runtime
		bounds checking on Kokkos data structures. As to be expected, enabling
		this option will negatively impact the performance and thus is only
		recommended when developing a Kokkos-enabled style in LAMMPS.

		The CMake option ``-DKokkos_ENABLE_CUDA_UVM=on`` or the makefile
		setting ``KOKKOS_CUDA_OPTIONS=enable_lambda,force_uvm`` enables the
		use of CUDA "Unified Virtual Memory" in Kokkos. Please note, that
		the LAMMPS KOKKOS package must always be compiled with the
		enable_lambda option when using GPUs.

		----------

		.. _latte:

doc/src/Speed_kokkos.rst

+60 −66

Original line number	Diff line number	Diff line
		@@ -28,27 +28,30 @@ compatible with specific hardware.

		.. note::

		To build with Kokkos support for NVIDIA GPUs, NVIDIA CUDA
		To build with Kokkos support for NVIDIA GPUs, the NVIDIA CUDA toolkit
		software version 9.0 or later must be installed on your system. See
		the discussion for the :doc:`GPU package <Speed_gpu>` for details of how
		to check and do this.
		the discussion for the :doc:`GPU package <Speed_gpu>` for details of
		how to check and do this.

		.. note::

		Kokkos with CUDA currently implicitly assumes that the MPI library
		is CUDA-aware. This is not always the case, especially when using
		pre-compiled MPI libraries provided by a Linux distribution. This is not
		a problem when using only a single GPU with a single MPI rank. When
		running with multiple MPI ranks, you may see segmentation faults without
		CUDA-aware MPI support. These can be avoided by adding the flags :doc:`-pk kokkos cuda/aware off <Run_options>` to the LAMMPS command line or by
		using the command :doc:`package kokkos cuda/aware off <package>` in the
		input file.
		Kokkos with CUDA currently implicitly assumes that the MPI library is
		CUDA-aware. This is not always the case, especially when using
		pre-compiled MPI libraries provided by a Linux distribution. This is
		not a problem when using only a single GPU with a single MPI
		rank. When running with multiple MPI ranks, you may see segmentation
		faults without CUDA-aware MPI support. These can be avoided by adding
		the flags :doc:`-pk kokkos cuda/aware off <Run_options>` to the
		LAMMPS command line or by using the command :doc:`package kokkos
		cuda/aware off <package>` in the input file.

		Building LAMMPS with the KOKKOS package:
		Building LAMMPS with the KOKKOS package
		"""""""""""""""""""""""""""""""""""""""

		See the :ref:`Build extras <kokkos>` doc page for instructions.

		Running LAMMPS with the KOKKOS package:
		Running LAMMPS with the KOKKOS package
		""""""""""""""""""""""""""""""""""""""

		All Kokkos operations occur within the context of an individual MPI
		task running on a single node of the machine. The total number of MPI
		@@ -57,7 +60,8 @@ usual manner via the mpirun or mpiexec commands, and is independent of
		Kokkos. E.g. the mpirun command in OpenMPI does this via its -np and
		-npernode switches. Ditto for MPICH via -np and -ppn.

		Running on a multi-core CPU:
		Running on a multi-core CPU
		^^^^^^^^^^^^^^^^^^^^^^^^^^^

		Here is a quick overview of how to use the KOKKOS package
		for CPU acceleration, assuming one or more 16-core nodes.
		@@ -133,7 +137,8 @@ atom. When using the Kokkos Serial back end or the OpenMP back end with
		a single thread, no duplication or atomic operations are used. For CUDA
		and half neighbor lists, the KOKKOS package always uses atomic operations.

		Core and Thread Affinity:
		CPU Cores, Sockets and Thread Affinity
		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

		When using multi-threading, it is important for performance to bind
		both MPI tasks to physical cores, and threads to physical cores, so
		@@ -147,15 +152,16 @@ for your MPI installation), binding can be forced with these flags:
		OpenMPI 1.8: mpirun -np 2 --bind-to socket --map-by socket ./lmp_openmpi ...
		Mvapich2 2.0: mpiexec -np 2 --bind-to socket --map-by socket ./lmp_mvapich ...

		For binding threads with KOKKOS OpenMP, use thread affinity
		environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or
		later, intel 12 or later) setting the environment variable
		OMP_PROC_BIND=true should be sufficient. In general, for best
		performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and
		OMP_PLACES=threads. For binding threads with the KOKKOS pthreads
		option, compile LAMMPS the KOKKOS HWLOC=yes option as described below.
		For binding threads with KOKKOS OpenMP, use thread affinity environment
		variables to force binding. With OpenMP 3.1 (gcc 4.7 or later, intel 12
		or later) setting the environment variable ``OMP_PROC_BIND=true`` should
		be sufficient. In general, for best performance with OpenMP 4.0 or later
		set ``OMP_PROC_BIND=spread`` and ``OMP_PLACES=threads``. For binding
		threads with the KOKKOS pthreads option, compile LAMMPS with the hwloc
		or libnuma support enabled as described in the :ref:`extra build options page <kokkos>`.

		Running on Knight's Landing (KNL) Intel Xeon Phi:
		Running on Knight's Landing (KNL) Intel Xeon Phi
		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

		Here is a quick overview of how to use the KOKKOS package for the
		Intel Knight's Landing (KNL) Xeon Phi:
		@@ -213,7 +219,8 @@ threads/task as Nt. The product of these two values should be N, i.e.
		them in "native" mode, not "offload" mode like the USER-INTEL package
		supports.

		Running on GPUs:
		Running on GPUs
		^^^^^^^^^^^^^^^

		Use the "-k" :doc:`command-line switch <Run_options>` to specify the
		number of GPUs per node. Typically the -np setting of the mpirun command
		@@ -277,7 +284,8 @@ one or more nodes, each with two GPUs:
		kspace, etc., you must set the environment variable CUDA_LAUNCH_BLOCKING=1.
		However, this will reduce performance and is not recommended for production runs.

		Run with the KOKKOS package by editing an input script:
		Run with the KOKKOS package by editing an input script
		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

		Alternatively the effect of the "-sf" or "-pk" switches can be
		duplicated by adding the :doc:`package kokkos <package>` or :doc:`suffix kk <suffix>` commands to your input script.
		@@ -300,17 +308,24 @@ You only need to use the :doc:`package kokkos <package>` command if you
		wish to change any of its option defaults, as set by the "-k on"
		:doc:`command-line switch <Run_options>`.

		Using OpenMP threading and CUDA together (experimental):
		Using OpenMP threading and CUDA together:

		With the KOKKOS package, both OpenMP multi-threading and GPUs can be
		used together in a few special cases. In the Makefile, the
		KOKKOS_DEVICES variable must include both "Cuda" and "OpenMP", as is
		the case for /src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi
		compiled and used together in a few special cases. In the makefile for
		the conventional build, the KOKKOS_DEVICES variable must include both,
		"Cuda" and "OpenMP", as is the case for ``/src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi``.

		.. code-block:: bash

		KOKKOS_DEVICES=Cuda,OpenMP

		When building with CMake you need to enable both features as it is done
		in the ``kokkos-cuda.cmake`` CMake preset file.

		.. code-block:: bash

		cmake ../cmake -DKokkos_ENABLE_CUDA=yes -DKokkos_ENABLE_OPENMP=yes

		The suffix "/kk" is equivalent to "/kk/device", and for Kokkos CUDA,
		using the "-sf kk" in the command line gives the default CUDA version
		everywhere. However, if the "/kk/host" suffix is added to a specific
		@@ -344,7 +359,8 @@ suffix for kspace and bonds, angles, etc. in the input file and the
		sure the environment variable CUDA_LAUNCH_BLOCKING is not set to "1"
		so CPU/GPU overlap can occur.

		Speed-ups to expect:
		Performance to expect
		"""""""""""""""""""""

		The performance of KOKKOS running in different modes is a function of
		your hardware, which KOKKOS-enable styles are used, and the problem
		@@ -361,48 +377,26 @@ Generally speaking, the following rules of thumb apply:
		performance of a KOKKOS style is a bit slower than the USER-OMP
		package.
		* When running large number of atoms per GPU, KOKKOS is typically faster
		than the GPU package.
		than the GPU package when compiled for double precision. The benefit
		of using single or mixed precision with the GPU package depends
		significantly on the hardware in use and the simulated system and pair
		style.
		* When running on Intel hardware, KOKKOS is not as fast as
		the USER-INTEL package, which is optimized for that hardware.
		the USER-INTEL package, which is optimized for x86 hardware (not just
		from Intel) and compilation with the Intel compilers. The USER-INTEL
		package also can increase the vector length of vector instructions
		by switching to single or mixed precision mode.

		See the `Benchmark page <https://lammps.sandia.gov/bench.html>`_ of the
		LAMMPS web site for performance of the KOKKOS package on different
		hardware.

		Advanced Kokkos options:

		There are other allowed options when building with the KOKKOS package.
		As explained on the :ref:`Build extras <kokkos>` doc page,
		they can be set either as variables on the make command line or in
		Makefile.machine, or they can be specified as CMake variables. Each
		takes a value shown below. The default value is listed, which is set
		in the lib/kokkos/Makefile.kokkos file. For a full listing of all options,
		see lib/kokkos/Makefile.kokkos.

		* KOKKOS_USE_TPLS, values = hwloc\ , librt\ , experimental_memkind, default = none
		* KOKKOS_DEBUG, values = yes\ , no\ , default = no
		* KOKKOS_CUDA_OPTIONS, values = force_uvm, use_ldg, rdc\ , enable_lambda\ , enable_constexpr, default = enable_lambda

		KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not
		migrate during a simulation. KOKKOS_USE_TPLS=hwloc should always be
		used if running with KOKKOS_DEVICES=Pthreads for pthreads. It is not
		necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP
		provides alternative methods via environment variables for binding
		threads to hardware cores. More info on binding threads to cores is
		given on the :doc:`Speed omp <Speed_omp>` doc page.

		KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism
		on most Unix platforms. This library is not available on all
		platforms.

		KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style
		within LAMMPS. KOKKOS_DEBUG=yes enables printing of run-time
		debugging information that can be useful. It also enables runtime
		bounds checking on Kokkos data structures.

		KOKKOS_CUDA_OPTIONS are additional options for CUDA. The LAMMPS KOKKOS
		package must be compiled with the enable_lambda option when using
		GPUs.
		Advanced Kokkos options
		"""""""""""""""""""""""

		There are other allowed options when building with the KOKKOS package
		that can improve performance or assist in debugging or profiling.
		They are explained on the :ref:`KOKKOS section of the build extras <kokkos>` doc page,

		Restrictions
		""""""""""""

doc/utils/sphinx-config/false_positives.txt

+2 −0

Original line number	Diff line number	Diff line
		@@ -499,6 +499,7 @@ cuda
		Cuda
		CUDA
		CuH
		cuFFT
		Cummins
		Curk
		customIDs
		@@ -1544,6 +1545,7 @@ libmeam
		libmessage
		libmpi
		libmpich
		libnuma
		libplumed
		libplumedKernel
		libpng

Admin message