Merge pull request #1580 from stanmoore1/kk_cuda_aware (35cee904) · Commits · 郑智淋 / lammps

doc/src/Speed_kokkos.txt

+11 −13

Original line number	Diff line number	Diff line
		@@ -46,16 +46,15 @@ software version 7.5 or later must be installed on your system. See
		the discussion for the "GPU package"_Speed_gpu.html for details of how
		to check and do this.

		NOTE: Kokkos with CUDA currently implicitly assumes that the MPI
		library is CUDA-aware and has support for GPU-direct. This is not
		always the case, especially when using pre-compiled MPI libraries
		provided by a Linux distribution. This is not a problem when using
		only a single GPU and a single MPI rank on a desktop. When running
		with multiple MPI ranks, you may see segmentation faults without
		GPU-direct support. These can be avoided by adding the flags "-pk
		kokkos gpu/direct off"_Run_options.html to the LAMMPS command line or
		by using the command "package kokkos gpu/direct off"_package.html in
		the input file.
		NOTE: Kokkos with CUDA currently implicitly assumes that the MPI library
		is CUDA-aware. This is not always the case, especially when using
		pre-compiled MPI libraries provided by a Linux distribution. This is not
		a problem when using only a single GPU with a single MPI rank. When
		running with multiple MPI ranks, you may see segmentation faults without
		CUDA-aware MPI support. These can be avoided by adding the flags "-pk
		kokkos cuda/aware off"_Run_options.html to the LAMMPS command line or by
		using the command "package kokkos cuda/aware off"_package.html in the
		input file.

		[Building LAMMPS with the KOKKOS package:]

		@@ -217,9 +216,8 @@ case, also packing/unpacking communication buffers on the host may give
		speedup (see the KOKKOS "package"_package.html command). Using CUDA MPS
		is recommended in this scenario.

		Using a CUDA-aware MPI library with
		support for GPU-direct is highly recommended. GPU-direct use can be
		avoided by using "-pk kokkos gpu/direct no"_package.html. As above for
		Using a CUDA-aware MPI library is highly recommended. CUDA-aware MPI use can be
		avoided by using "-pk kokkos cuda/aware no"_package.html. As above for
		multi-core CPUs (and no GPU), if N is the number of physical cores/node,
		then the number of MPI tasks/node should not exceed N.

doc/src/package.txt

+15 −13

Original line number	Diff line number	Diff line
		@@ -64,7 +64,7 @@ args = arguments specific to the style :l
		{no_affinity} values = none
		{kokkos} args = keyword value ...
		zero or more keyword/value pairs may be appended
		keywords = {neigh} or {neigh/qeq} or {neigh/thread} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {gpu/direct}
		keywords = {neigh} or {neigh/qeq} or {neigh/thread} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {cuda/aware}
		{neigh} value = {full} or {half}
		full = full neighbor list
		half = half neighbor list built in thread-safe manner
		@@ -87,9 +87,9 @@ args = arguments specific to the style :l
		no = perform communication pack/unpack in non-KOKKOS mode
		host = perform pack/unpack on host (e.g. with OpenMP threading)
		device = perform pack/unpack on device (e.g. on GPU)
		{gpu/direct} = {off} or {on}
		off = do not use GPU-direct
		on = use GPU-direct (default)
		{cuda/aware} = {off} or {on}
		off = do not use CUDA-aware MPI
		on = use CUDA-aware MPI (default)
		{omp} args = Nthreads keyword value ...
		Nthread = # of OpenMP threads to associate with each MPI process
		zero or more keyword/value pairs may be appended
		@@ -520,19 +520,21 @@ pack/unpack communicated data. When running small systems on a GPU,
		performing the exchange pack/unpack on the host CPU can give speedup
		since it reduces the number of CUDA kernel launches.

		The {gpu/direct} keyword chooses whether GPU-direct will be used. When
		The {cuda/aware} keyword chooses whether CUDA-aware MPI will be used. When
		this keyword is set to {on}, buffers in GPU memory are passed directly
		through MPI send/receive calls. This reduces overhead of first copying
		the data to the host CPU. However GPU-direct is not supported on all
		the data to the host CPU. However CUDA-aware MPI is not supported on all
		systems, which can lead to segmentation faults and would require using a
		value of {off}. If LAMMPS can safely detect that GPU-direct is not
		value of {off}. If LAMMPS can safely detect that CUDA-aware MPI is not
		available (currently only possible with OpenMPI v2.0.0 or later), then
		the {gpu/direct} keyword is automatically set to {off} by default. When
		the {gpu/direct} keyword is set to {off} while any of the {comm}
		the {cuda/aware} keyword is automatically set to {off} by default. When
		the {cuda/aware} keyword is set to {off} while any of the {comm}
		keywords are set to {device}, the value for these {comm} keywords will
		be automatically changed to {host}. This setting has no effect if not
		running on GPUs. GPU-direct is available for OpenMPI 1.8 (or later
		versions), Mvapich2 1.9 (or later), and CrayMPI.
		running on GPUs. CUDA-aware MPI is available for OpenMPI 1.8 (or later
		versions), Mvapich2 1.9 (or later) when the "MV2_USE_CUDA" environment
		variable is set to "1", CrayMPI, and IBM Spectrum MPI when the "-gpu"
		flag is used.

		:line

		@@ -641,8 +643,8 @@ switch"_Run_options.html.

		For the KOKKOS package, the option defaults for GPUs are neigh = full,
		neigh/qeq = full, newton = off, binsize for GPUs = 2x LAMMPS default
		value, comm = device, gpu/direct = on. When LAMMPS can safely detect
		that GPU-direct is not available, the default value of gpu/direct
		value, comm = device, cuda/aware = on. When LAMMPS can safely detect
		that CUDA-aware MPI is not available, the default value of cuda/aware
		becomes "off". For CPUs or Xeon Phis, the option defaults are neigh =
		half, neigh/qeq = half, newton = on, binsize = 0.0, and comm = no. The
		option neigh/thread = on when there are 16K atoms or less on an MPI

src/KOKKOS/comm_kokkos.cpp

+2 −2

Original line number	Diff line number	Diff line
		@@ -418,7 +418,7 @@ void CommKokkos::forward_comm_pair_device(Pair *pair)
		if (sendproc[iswap] != me) {
		double* buf_send_pair;
		double* buf_recv_pair;
		if (lmp->kokkos->gpu_direct_flag) {
		if (lmp->kokkos->cuda_aware_flag) {
		buf_send_pair = k_buf_send_pair.view<DeviceType>().data();
		buf_recv_pair = k_buf_recv_pair.view<DeviceType>().data();
		} else {
		@@ -436,7 +436,7 @@ void CommKokkos::forward_comm_pair_device(Pair *pair)
		MPI_Send(buf_send_pair,n,MPI_DOUBLE,sendproc[iswap],0,world);
		if (recvnum[iswap]) MPI_Wait(&request,MPI_STATUS_IGNORE);

		if (!lmp->kokkos->gpu_direct_flag) {
		if (!lmp->kokkos->cuda_aware_flag) {
		k_buf_recv_pair.modify<LMPHostType>();
		k_buf_recv_pair.sync<DeviceType>();
		}

src/KOKKOS/gridcomm_kokkos.cpp

+4 −4

Original line number	Diff line number	Diff line
		@@ -529,7 +529,7 @@ void GridCommKokkos<DeviceType>::forward_comm(KSpace *kspace, int which)
		if (swap[m].sendproc != me) {
		FFT_SCALAR* buf1;
		FFT_SCALAR* buf2;
		if (lmp->kokkos->gpu_direct_flag) {
		if (lmp->kokkos->cuda_aware_flag) {
		buf1 = k_buf1.view<DeviceType>().data();
		buf2 = k_buf2.view<DeviceType>().data();
		} else {
		@@ -545,7 +545,7 @@ void GridCommKokkos<DeviceType>::forward_comm(KSpace *kspace, int which)
		swap[m].sendproc,0,gridcomm);
		MPI_Wait(&request,MPI_STATUS_IGNORE);

		if (!lmp->kokkos->gpu_direct_flag) {
		if (!lmp->kokkos->cuda_aware_flag) {
		k_buf2.modify<LMPHostType>();
		k_buf2.sync<DeviceType>();
		}
		@@ -579,7 +579,7 @@ void GridCommKokkos<DeviceType>::reverse_comm(KSpace *kspace, int which)
		if (swap[m].recvproc != me) {
		FFT_SCALAR* buf1;
		FFT_SCALAR* buf2;
		if (lmp->kokkos->gpu_direct_flag) {
		if (lmp->kokkos->cuda_aware_flag) {
		buf1 = k_buf1.view<DeviceType>().data();
		buf2 = k_buf2.view<DeviceType>().data();
		} else {
		@@ -595,7 +595,7 @@ void GridCommKokkos<DeviceType>::reverse_comm(KSpace *kspace, int which)
		swap[m].recvproc,0,gridcomm);
		MPI_Wait(&request,MPI_STATUS_IGNORE);

		if (!lmp->kokkos->gpu_direct_flag) {
		if (!lmp->kokkos->cuda_aware_flag) {
		k_buf2.modify<LMPHostType>();
		k_buf2.sync<DeviceType>();
		}

src/KOKKOS/kokkos.cpp

+126 −51

Original line number	Diff line number	Diff line
		@@ -28,32 +28,36 @@

		#ifdef KOKKOS_ENABLE_CUDA

		// for detecting GPU-direct support:
		// the function int have_gpu_direct()
		// - returns -1 if GPU-direct support is unknown
		// - returns 0 if no GPU-direct support available
		// - returns 1 if GPU-direct support is available
		// for detecting CUDA-aware MPI support:
		// the variable int have_cuda_aware
		// - is 1 if CUDA-aware MPI support is available
		// - is 0 if CUDA-aware MPI support is unavailable
		// - is -1 if CUDA-aware MPI support is unknown

		#define GPU_DIRECT_UNKNOWN static int have_gpu_direct() {return -1;}
		#define CUDA_AWARE_UNKNOWN static int have_cuda_aware = -1;

		// OpenMPI supports detecting GPU-direct as of version 2.0.0
		#if OPEN_MPI
		// OpenMPI supports detecting CUDA-aware MPI as of version 2.0.0

		#if (OPEN_MPI)
		#if (OMPI_MAJOR_VERSION >= 2)

		#include <mpi-ext.h>
		#if defined(MPIX_CUDA_AWARE_SUPPORT)
		static int have_gpu_direct() { return MPIX_Query_cuda_support(); }

		#if defined(MPIX_CUDA_AWARE_SUPPORT) && MPIX_CUDA_AWARE_SUPPORT
		static int have_cuda_aware = 1;
		#elif defined(MPIX_CUDA_AWARE_SUPPORT) && !MPIX_CUDA_AWARE_SUPPORT
		static int have_cuda_aware = 0;
		#else
		GPU_DIRECT_UNKNOWN
		#endif
		CUDA_AWARE_UNKNOWN
		#endif // defined(MPIX_CUDA_AWARE_SUPPORT)

		#else // old OpenMPI
		GPU_DIRECT_UNKNOWN
		#endif
		CUDA_AWARE_UNKNOWN
		#endif // (OMPI_MAJOR_VERSION >=2)

		#else // unknown MPI library
		GPU_DIRECT_UNKNOWN
		#endif
		CUDA_AWARE_UNKNOWN
		#endif // OPEN_MPI

		#endif // KOKKOS_ENABLE_CUDA

		@@ -66,6 +70,10 @@ KokkosLMP::KokkosLMP(LAMMPS lmp, int narg, char *arg) : Pointers(lmp)
		kokkos_exists = 1;
		lmp->kokkos = this;

		exchange_comm_changed = 0;
		forward_comm_changed = 0;
		reverse_comm_changed = 0;

		delete memory;
		memory = new MemoryKokkos(lmp);
		memoryKK = (MemoryKokkos*) memory;
		@@ -145,29 +153,10 @@ KokkosLMP::KokkosLMP(LAMMPS lmp, int narg, char *arg) : Pointers(lmp)
		#ifdef KOKKOS_ENABLE_CUDA
		if (ngpus <= 0)
		error->all(FLERR,"Kokkos has been compiled for CUDA but no GPUs are requested");

		// check and warn about GPU-direct availability when using multiple MPI tasks

		int nmpi = 0;
		MPI_Comm_size(world,&nmpi);
		if ((nmpi > 1) && (me == 0)) {
		if ( 1 == have_gpu_direct() ) {
		; // all good, nothing to warn about
		} else if (-1 == have_gpu_direct() ) {
		error->warning(FLERR,"Kokkos with CUDA assumes GPU-direct is available,"
		" but cannot determine if this is the case\n try"
		" '-pk kokkos gpu/direct off' when getting segmentation faults");
		} else if ( 0 == have_gpu_direct() ) {
		error->warning(FLERR,"GPU-direct is NOT available, "
		"using '-pk kokkos gpu/direct off' by default");
		} else {
		; // should never get here
		}
		}
		#endif

		#ifndef KOKKOS_ENABLE_SERIAL
		if (nthreads == 1)
		if (nthreads == 1 && me == 0)
		error->warning(FLERR,"When using a single thread, the Kokkos Serial backend "
		"(i.e. Makefile.kokkos_mpi_only) gives better performance "
		"than the OpenMP backend");
		@@ -183,7 +172,11 @@ KokkosLMP::KokkosLMP(LAMMPS lmp, int narg, char *arg) : Pointers(lmp)
		// default settings for package kokkos command

		binsize = 0.0;
		gpu_direct_flag = 1;
		#ifdef KOKKOS_ENABLE_CUDA
		cuda_aware_flag = 1;
		#else
		cuda_aware_flag = 0;
		#endif
		neigh_thread = 0;
		neigh_thread_set = 0;
		neighflag_qeq_set = 0;
		@@ -206,10 +199,66 @@ KokkosLMP::KokkosLMP(LAMMPS lmp, int narg, char *arg) : Pointers(lmp)
		exchange_comm_on_host = forward_comm_on_host = reverse_comm_on_host = 0;
		}

		#if KOKKOS_USE_CUDA
		// only if we can safely detect, that GPU-direct is not available, change default
		if (0 == have_gpu_direct()) gpu_direct_flag = 0;
		#ifdef KOKKOS_ENABLE_CUDA

		// check and warn about CUDA-aware MPI availability when using multiple MPI tasks
		// change default only if we can safely detect that CUDA-aware MPI is not available

		int nmpi = 0;
		MPI_Comm_size(world,&nmpi);
		if (nmpi > 0) {

		// Check for IBM Spectrum MPI

		int len;
		char mpi_version[MPI_MAX_LIBRARY_VERSION_STRING];
		MPI_Get_library_version(mpi_version, &len);
		if (strstr(&mpi_version[0], "Spectrum") != NULL) {
		cuda_aware_flag = 0;
		char* str;
		if (str = getenv("OMPI_MCA_pml_pami_enable_cuda"))
		if((strcmp(str,"1") == 0)) {
		have_cuda_aware = 1;
		cuda_aware_flag = 1;
		}

		if (!cuda_aware_flag)
		if (me == 0)
		error->warning(FLERR,"The Spectrum MPI '-gpu' flag is not set. Disabling CUDA-aware MPI");
		}

		if (cuda_aware_flag == 1 && have_cuda_aware == 0) {
		if (me == 0)
		error->warning(FLERR,"Turning off CUDA-aware MPI since it is not detected, "
		"use '-pk kokkos cuda/aware on' to override");
		cuda_aware_flag = 0;
		} else if (have_cuda_aware == -1) { // maybe we are dealing with MPICH, MVAPICH2 or some derivative?
		// MVAPICH2
		#if (defined MPICH) && (defined MVAPICH2_VERSION)
		char* str;
		cuda_aware_flag = 0;
		if (str = getenv("MV2_ENABLE_CUDA")
		if ((strcmp(str,"1") == 0))
		cuda_aware_flag = 1;

		if (!cuda_aware_flag)
		if (me == 0)
		error->warning(FLERR,"MVAPICH2 'MV2_ENABLE_CUDA' environment variable is not set. Disabling CUDA-aware MPI");
		// pure MPICH or some unsupported MPICH derivative
		#elif (defined MPICH) && !(defined MVAPICH2_VERSION)
		if (me == 0)
		error->warning(FLERR,"Detected MPICH. Disabling CUDA-aware MPI");
		cuda_aware_flag = 0;
		#else
		if (me == 0)
		error->warning(FLERR,"Kokkos with CUDA assumes CUDA-aware MPI is available,"
		" but cannot determine if this is the case\n try"
		" '-pk kokkos cuda/aware off' if getting segmentation faults");

		#endif
		} // if (-1 == have_cuda_aware)
		} // nmpi > 0
		#endif // KOKKOS_ENABLE_CUDA

		#ifdef KILL_KOKKOS_ON_SIGSEGV
		signal(SIGSEGV, my_signal_handler);
		@@ -290,6 +339,7 @@ void KokkosLMP::accelerator(int narg, char **arg)
		exchange_comm_classic = 0;
		exchange_comm_on_host = 0;
		} else error->all(FLERR,"Illegal package kokkos command");
		exchange_comm_changed = 0;
		iarg += 2;
		} else if (strcmp(arg[iarg],"comm/forward") == 0) {
		if (iarg+2 > narg) error->all(FLERR,"Illegal package kokkos command");
		@@ -301,6 +351,7 @@ void KokkosLMP::accelerator(int narg, char **arg)
		forward_comm_classic = 0;
		forward_comm_on_host = 0;
		} else error->all(FLERR,"Illegal package kokkos command");
		forward_comm_changed = 0;
		iarg += 2;
		} else if (strcmp(arg[iarg],"comm/reverse") == 0) {
		if (iarg+2 > narg) error->all(FLERR,"Illegal package kokkos command");
		@@ -312,11 +363,12 @@ void KokkosLMP::accelerator(int narg, char **arg)
		reverse_comm_classic = 0;
		reverse_comm_on_host = 0;
		} else error->all(FLERR,"Illegal package kokkos command");
		reverse_comm_changed = 0;
		iarg += 2;
		} else if (strcmp(arg[iarg],"gpu/direct") == 0) {
		} else if (strcmp(arg[iarg],"cuda/aware") == 0) {
		if (iarg+2 > narg) error->all(FLERR,"Illegal package kokkos command");
		if (strcmp(arg[iarg+1],"off") == 0) gpu_direct_flag = 0;
		else if (strcmp(arg[iarg+1],"on") == 0) gpu_direct_flag = 1;
		if (strcmp(arg[iarg+1],"off") == 0) cuda_aware_flag = 0;
		else if (strcmp(arg[iarg+1],"on") == 0) cuda_aware_flag = 1;
		else error->all(FLERR,"Illegal package kokkos command");
		iarg += 2;
		} else if (strcmp(arg[iarg],"neigh/thread") == 0) {
		@@ -329,15 +381,38 @@ void KokkosLMP::accelerator(int narg, char **arg)
		} else error->all(FLERR,"Illegal package kokkos command");
		}

		// if "gpu/direct off" and "comm device", change to "comm host"
		// if "cuda/aware off" and "comm device", change to "comm host"

		if (!gpu_direct_flag) {
		if (exchange_comm_classic == 0 && exchange_comm_on_host == 0)
		if (!cuda_aware_flag) {
		if (exchange_comm_classic == 0 && exchange_comm_on_host == 0) {
		exchange_comm_on_host = 1;
		if (forward_comm_classic == 0 && forward_comm_on_host == 0)
		exchange_comm_changed = 1;
		}
		if (forward_comm_classic == 0 && forward_comm_on_host == 0) {
		forward_comm_on_host = 1;
		if (reverse_comm_classic == 0 && reverse_comm_on_host == 0)
		forward_comm_changed = 1;
		}
		if (reverse_comm_classic == 0 && reverse_comm_on_host == 0) {
		reverse_comm_on_host = 1;
		reverse_comm_changed = 1;
		}
		}

		// if "cuda/aware on" and comm flags were changed previously, change them back

		if (cuda_aware_flag) {
		if (exchange_comm_changed) {
		exchange_comm_on_host = 0;
		exchange_comm_changed = 0;
		}
		if (forward_comm_changed) {
		forward_comm_on_host = 0;
		forward_comm_changed = 0;
		}
		if (reverse_comm_changed) {
		reverse_comm_on_host = 0;
		reverse_comm_changed = 0;
		}
		}

		// set newton flags

Admin message