Merge branch 'master' into enh-ext-reaxc (9421466f) · Commits · 郑智淋 / lammps

doc/src/Speed_kokkos.txt

+16 −14

Original line number	Diff line number	Diff line
		@@ -46,7 +46,7 @@ software version 7.5 or later must be installed on your system. See
		the discussion for the "GPU package"_Speed_gpu.html for details of how
		to check and do this.

		NOTE: Kokkos with CUDA currently implicitly assumes, that the MPI
		NOTE: Kokkos with CUDA currently implicitly assumes that the MPI
		library is CUDA-aware and has support for GPU-direct. This is not
		always the case, especially when using pre-compiled MPI libraries
		provided by a Linux distribution. This is not a problem when using
		@@ -207,19 +207,21 @@ supports.

		[Running on GPUs:]

		Use the "-k" "command-line switch"_Run_options.html to
		specify the number of GPUs per node. Typically the -np setting of the
		mpirun command should set the number of MPI tasks/node to be equal to
		the number of physical GPUs on the node. You can assign multiple MPI
		tasks to the same GPU with the KOKKOS package, but this is usually
		only faster if significant portions of the input script have not
		been ported to use Kokkos. Using CUDA MPS is recommended in this
		scenario. Using a CUDA-aware MPI library with support for GPU-direct
		is highly recommended. GPU-direct use can be avoided by using
		"-pk kokkos gpu/direct no"_package.html.
		As above for multi-core CPUs (and no GPU), if N is the number of
		physical cores/node, then the number of MPI tasks/node should not
		exceed N.
		Use the "-k" "command-line switch"_Run_options.html to specify the
		number of GPUs per node. Typically the -np setting of the mpirun command
		should set the number of MPI tasks/node to be equal to the number of
		physical GPUs on the node. You can assign multiple MPI tasks to the same
		GPU with the KOKKOS package, but this is usually only faster if some
		portions of the input script have not been ported to use Kokkos. In this
		case, also packing/unpacking communication buffers on the host may give
		speedup (see the KOKKOS "package"_package.html command). Using CUDA MPS
		is recommended in this scenario.

		Using a CUDA-aware MPI library with
		support for GPU-direct is highly recommended. GPU-direct use can be
		avoided by using "-pk kokkos gpu/direct no"_package.html. As above for
		multi-core CPUs (and no GPU), if N is the number of physical cores/node,
		then the number of MPI tasks/node should not exceed N.

		-k on g Ng :pre

doc/src/package.txt

+37 −20

Original line number	Diff line number	Diff line
		@@ -64,13 +64,16 @@ args = arguments specific to the style :l
		{no_affinity} values = none
		{kokkos} args = keyword value ...
		zero or more keyword/value pairs may be appended
		keywords = {neigh} or {neigh/qeq} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {gpu/direct}
		keywords = {neigh} or {neigh/qeq} or {neigh/thread} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {gpu/direct}
		{neigh} value = {full} or {half}
		full = full neighbor list
		half = half neighbor list built in thread-safe manner
		{neigh/qeq} value = {full} or {half}
		full = full neighbor list
		half = half neighbor list built in thread-safe manner
		{neigh/thread} value = {off} or {on}
		off = thread only over atoms
		on = thread over both atoms and neighbors
		{newton} = {off} or {on}
		off = set Newton pairwise and bonded flags off
		on = set Newton pairwise and bonded flags on
		@@ -444,6 +447,18 @@ the {neigh/qeq} keyword determines how neighbor lists are built for "fix
		qeq/reax/kk"_fix_qeq_reax.html. If not explicitly set, the value of
		{neigh/qeq} will match {neigh}.

		If the {neigh/thread} keyword is set to {off}, then the KOKKOS package
		threads only over atoms. However, for small systems, this may not expose
		enough parallelism to keep a GPU busy. When this keyword is set to {on},
		the KOKKOS package threads over both atoms and neighbors of atoms. When
		using {neigh/thread} {on}, a full neighbor list must also be used. Using
		{neigh/thread} {on} may be slower for large systems, so this this option
		is turned on by default only when there are 16K atoms or less owned by
		an MPI rank and when using a full neighbor list. Not all KOKKOS-enabled
		potentials support this keyword yet, and only thread over atoms. Many
		simple pair-wise potentials such as Lennard-Jones do support threading
		over both atoms and neighbors.

		The {newton} keyword sets the Newton flags for pairwise and bonded
		interactions to {off} or {on}, the same as the "newton"_newton.html
		command allows. The default for GPUs is {off} because this will almost
		@@ -498,14 +513,14 @@ identically. When using GPUs, the {device} value is the default since it
		will typically be optimal if all of your styles used in your input
		script are supported by the KOKKOS package. In this case data can stay
		on the GPU for many timesteps without being moved between the host and
		GPU, if you use the {device} value. This requires that your MPI is able
		to access GPU memory directly. Currently that is true for OpenMPI 1.8
		(or later versions), Mvapich2 1.9 (or later), and CrayMPI. If your
		script uses styles (e.g. fixes) which are not yet supported by the
		KOKKOS package, then data has to be move between the host and device
		anyway, so it is typically faster to let the host handle communication,
		by using the {host} value. Using {host} instead of {no} will enable use
		of multiple threads to pack/unpack communicated data.
		GPU, if you use the {device} value. If your script uses styles (e.g.
		fixes) which are not yet supported by the KOKKOS package, then data has
		to be move between the host and device anyway, so it is typically faster
		to let the host handle communication, by using the {host} value. Using
		{host} instead of {no} will enable use of multiple threads to
		pack/unpack communicated data. When running small systems on a GPU,
		performing the exchange pack/unpack on the host CPU can give speedup
		since it reduces the number of CUDA kernel launches.

		The {gpu/direct} keyword chooses whether GPU-direct will be used. When
		this keyword is set to {on}, buffers in GPU memory are passed directly
		@@ -518,7 +533,8 @@ the {gpu/direct} keyword is automatically set to {off} by default. When
		the {gpu/direct} keyword is set to {off} while any of the {comm}
		keywords are set to {device}, the value for these {comm} keywords will
		be automatically changed to {host}. This setting has no effect if not
		running on GPUs.
		running on GPUs. GPU-direct is available for OpenMPI 1.8 (or later
		versions), Mvapich2 1.9 (or later), and CrayMPI.

		:line

		@@ -630,11 +646,12 @@ neigh/qeq = full, newton = off, binsize for GPUs = 2x LAMMPS default
		value, comm = device, gpu/direct = on. When LAMMPS can safely detect
		that GPU-direct is not available, the default value of gpu/direct
		becomes "off". For CPUs or Xeon Phis, the option defaults are neigh =
		half, neigh/qeq = half, newton = on, binsize = 0.0, and comm = no. These
		settings are made automatically by the required "-k on" "command-line
		switch"_Run_options.html. You can change them by using the package
		kokkos command in your input script or via the "-pk kokkos command-line
		switch"_Run_options.html.
		half, neigh/qeq = half, newton = on, binsize = 0.0, and comm = no. The
		option neigh/thread = on when there are 16K atoms or less on an MPI
		rank, otherwise it is "off". These settings are made automatically by
		the required "-k on" "command-line switch"_Run_options.html. You can
		change them by using the package kokkos command in your input script or
		via the "-pk kokkos command-line switch"_Run_options.html.

		For the OMP package, the default is Nthreads = 0 and the option
		defaults are neigh = yes. These settings are made automatically if

src/KOKKOS/atom_kokkos.cpp

+3 −0

Original line number	Diff line number	Diff line
		@@ -22,6 +22,7 @@
		#include "memory_kokkos.h"
		#include "error.h"
		#include "kokkos.h"
		#include "atom_masks.h"

		using namespace LAMMPS_NS;

		@@ -270,8 +271,10 @@ int AtomKokkos::add_custom(const char *name, int flag)
		int n = strlen(name) + 1;
		dname[index] = new char[n];
		strcpy(dname[index],name);
		this->sync(Device,DVECTOR_MASK);
		memoryKK->grow_kokkos(k_dvector,dvector,ndvector,nmax,
		"atom:dvector");
		this->modified(Device,DVECTOR_MASK);
		}

		return index;

src/KOKKOS/atom_vec_angle_kokkos.cpp

+25 −24

Original line number	Diff line number	Diff line
		@@ -24,7 +24,7 @@

		using namespace LAMMPS_NS;

		#define DELTA 10000
		#define DELTA 10

		/* ---------------------------------------------------------------------- */

		@@ -59,14 +59,15 @@ AtomVecAngleKokkos::AtomVecAngleKokkos(LAMMPS *lmp) : AtomVecKokkos(lmp)

		void AtomVecAngleKokkos::grow(int n)
		{
		if (n == 0) nmax += DELTA;
		int step = MAX(DELTA,nmax*0.01);
		if (n == 0) nmax += step;
		else nmax = n;
		atomKK->nmax = nmax;
		if (nmax < 0 \|\| nmax > MAXSMALLINT)
		error->one(FLERR,"Per-processor system is too big");

		sync(Device,ALL_MASK);
		modified(Device,ALL_MASK);
		atomKK->sync(Device,ALL_MASK);
		atomKK->modified(Device,ALL_MASK);

		memoryKK->grow_kokkos(atomKK->k_tag,atomKK->tag,nmax,"atom:tag");
		memoryKK->grow_kokkos(atomKK->k_type,atomKK->type,nmax,"atom:type");
		@@ -98,7 +99,7 @@ void AtomVecAngleKokkos::grow(int n)
		"atom:angle_atom3");

		grow_reset();
		sync(Host,ALL_MASK);
		atomKK->sync(Host,ALL_MASK);

		if (atom->nextra_grow)
		for (int iextra = 0; iextra < atom->nextra_grow; iextra++)
		@@ -282,7 +283,7 @@ int AtomVecAngleKokkos::pack_comm_kokkos(const int &n,
		// Choose correct forward PackComm kernel

		if(commKK->forward_comm_on_host) {
		sync(Host,X_MASK);
		atomKK->sync(Host,X_MASK);
		if(pbc_flag) {
		if(domain->triclinic) {
		struct AtomVecAngleKokkos_PackComm<LMPHostType,1,1> f(atomKK->k_x,buf,list,iswap,
		@@ -309,7 +310,7 @@ int AtomVecAngleKokkos::pack_comm_kokkos(const int &n,
		}
		}
		} else {
		sync(Device,X_MASK);
		atomKK->sync(Device,X_MASK);
		if(pbc_flag) {
		if(domain->triclinic) {
		struct AtomVecAngleKokkos_PackComm<LMPDeviceType,1,1> f(atomKK->k_x,buf,list,iswap,
		@@ -397,8 +398,8 @@ int AtomVecAngleKokkos::pack_comm_self(const int &n, const DAT::tdual_int_2d &li
		const int nfirst, const int &pbc_flag,
		const int* const pbc) {
		if(commKK->forward_comm_on_host) {
		sync(Host,X_MASK);
		modified(Host,X_MASK);
		atomKK->sync(Host,X_MASK);
		atomKK->modified(Host,X_MASK);
		if(pbc_flag) {
		if(domain->triclinic) {
		struct AtomVecAngleKokkos_PackCommSelf<LMPHostType,1,1>
		@@ -429,8 +430,8 @@ int AtomVecAngleKokkos::pack_comm_self(const int &n, const DAT::tdual_int_2d &li
		}
		}
		} else {
		sync(Device,X_MASK);
		modified(Device,X_MASK);
		atomKK->sync(Device,X_MASK);
		atomKK->modified(Device,X_MASK);
		if(pbc_flag) {
		if(domain->triclinic) {
		struct AtomVecAngleKokkos_PackCommSelf<LMPDeviceType,1,1>
		@@ -493,13 +494,13 @@ struct AtomVecAngleKokkos_UnpackComm {
		void AtomVecAngleKokkos::unpack_comm_kokkos(const int &n, const int &first,
		const DAT::tdual_xfloat_2d &buf ) {
		if(commKK->forward_comm_on_host) {
		sync(Host,X_MASK);
		modified(Host,X_MASK);
		atomKK->sync(Host,X_MASK);
		atomKK->modified(Host,X_MASK);
		struct AtomVecAngleKokkos_UnpackComm<LMPHostType> f(atomKK->k_x,buf,first);
		Kokkos::parallel_for(n,f);
		} else {
		sync(Device,X_MASK);
		modified(Device,X_MASK);
		atomKK->sync(Device,X_MASK);
		atomKK->modified(Device,X_MASK);
		struct AtomVecAngleKokkos_UnpackComm<LMPDeviceType> f(atomKK->k_x,buf,first);
		Kokkos::parallel_for(n,f);
		}
		@@ -642,7 +643,7 @@ void AtomVecAngleKokkos::unpack_comm_vel(int n, int first, double *buf)
		int AtomVecAngleKokkos::pack_reverse(int n, int first, double *buf)
		{
		if(n > 0)
		sync(Host,F_MASK);
		atomKK->sync(Host,F_MASK);

		int m = 0;
		const int last = first + n;
		@@ -659,7 +660,7 @@ int AtomVecAngleKokkos::pack_reverse(int n, int first, double *buf)
		void AtomVecAngleKokkos::unpack_reverse(int n, int list, double buf)
		{
		if(n > 0)
		modified(Host,F_MASK);
		atomKK->modified(Host,F_MASK);

		int m = 0;
		for (int i = 0; i < n; i++) {
		@@ -960,9 +961,9 @@ struct AtomVecAngleKokkos_UnpackBorder {
		void AtomVecAngleKokkos::unpack_border_kokkos(const int &n, const int &first,
		const DAT::tdual_xfloat_2d &buf,
		ExecutionSpace space) {
		modified(space,X_MASK\|TAG_MASK\|TYPE_MASK\|MASK_MASK\|MOLECULE_MASK);
		atomKK->modified(space,X_MASK\|TAG_MASK\|TYPE_MASK\|MASK_MASK\|MOLECULE_MASK);
		while (first+n >= nmax) grow(0);
		modified(space,X_MASK\|TAG_MASK\|TYPE_MASK\|MASK_MASK\|MOLECULE_MASK);
		atomKK->modified(space,X_MASK\|TAG_MASK\|TYPE_MASK\|MASK_MASK\|MOLECULE_MASK);
		if(space==Host) {
		struct AtomVecAngleKokkos_UnpackBorder<LMPHostType>
		f(buf.view<LMPHostType>(),h_x,h_tag,h_type,h_mask,h_molecule,first);
		@@ -984,7 +985,7 @@ void AtomVecAngleKokkos::unpack_border(int n, int first, double *buf)
		last = first + n;
		for (i = first; i < last; i++) {
		if (i == nmax) grow(0);
		modified(Host,X_MASK\|TAG_MASK\|TYPE_MASK\|MASK_MASK\|MOLECULE_MASK);
		atomKK->modified(Host,X_MASK\|TAG_MASK\|TYPE_MASK\|MASK_MASK\|MOLECULE_MASK);
		h_x(i,0) = buf[m++];
		h_x(i,1) = buf[m++];
		h_x(i,2) = buf[m++];
		@@ -1010,7 +1011,7 @@ void AtomVecAngleKokkos::unpack_border_vel(int n, int first, double *buf)
		last = first + n;
		for (i = first; i < last; i++) {
		if (i == nmax) grow(0);
		modified(Host,X_MASK\|V_MASK\|TAG_MASK\|TYPE_MASK\|MASK_MASK\|MOLECULE_MASK);
		atomKK->modified(Host,X_MASK\|V_MASK\|TAG_MASK\|TYPE_MASK\|MASK_MASK\|MOLECULE_MASK);
		h_x(i,0) = buf[m++];
		h_x(i,1) = buf[m++];
		h_x(i,2) = buf[m++];
		@@ -1412,7 +1413,7 @@ int AtomVecAngleKokkos::unpack_exchange(double *buf)
		{
		int nlocal = atom->nlocal;
		if (nlocal == nmax) grow(0);
		modified(Host,X_MASK \| V_MASK \| TAG_MASK \| TYPE_MASK \|
		atomKK->modified(Host,X_MASK \| V_MASK \| TAG_MASK \| TYPE_MASK \|
		MASK_MASK \| IMAGE_MASK \| MOLECULE_MASK \| BOND_MASK \|
		ANGLE_MASK \| SPECIAL_MASK);

		@@ -1487,7 +1488,7 @@ int AtomVecAngleKokkos::size_restart()

		int AtomVecAngleKokkos::pack_restart(int i, double *buf)
		{
		sync(Host,X_MASK \| V_MASK \| TAG_MASK \| TYPE_MASK \|
		atomKK->sync(Host,X_MASK \| V_MASK \| TAG_MASK \| TYPE_MASK \|
		MASK_MASK \| IMAGE_MASK \| MOLECULE_MASK \| BOND_MASK \|
		ANGLE_MASK \| SPECIAL_MASK);

		@@ -1541,7 +1542,7 @@ int AtomVecAngleKokkos::unpack_restart(double *buf)
		if (atom->nextra_store)
		memory->grow(atom->extra,nmax,atom->nextra_store,"atom:extra");
		}
		modified(Host,X_MASK \| V_MASK \| TAG_MASK \| TYPE_MASK \|
		atomKK->modified(Host,X_MASK \| V_MASK \| TAG_MASK \| TYPE_MASK \|
		MASK_MASK \| IMAGE_MASK \| MOLECULE_MASK \| BOND_MASK \|
		ANGLE_MASK \| SPECIAL_MASK);

src/KOKKOS/atom_vec_atomic_kokkos.cpp

+13 −12

Original line number	Diff line number	Diff line
		@@ -24,7 +24,7 @@

		using namespace LAMMPS_NS;

		#define DELTA 10000
		#define DELTA 10

		/* ---------------------------------------------------------------------- */

		@@ -55,14 +55,15 @@ AtomVecAtomicKokkos::AtomVecAtomicKokkos(LAMMPS *lmp) : AtomVecKokkos(lmp)

		void AtomVecAtomicKokkos::grow(int n)
		{
		if (n == 0) nmax += DELTA;
		int step = MAX(DELTA,nmax*0.01);
		if (n == 0) nmax += step;
		else nmax = n;
		atomKK->nmax = nmax;
		if (nmax < 0 \|\| nmax > MAXSMALLINT)
		error->one(FLERR,"Per-processor system is too big");

		sync(Device,ALL_MASK);
		modified(Device,ALL_MASK);
		atomKK->sync(Device,ALL_MASK);
		atomKK->modified(Device,ALL_MASK);

		memoryKK->grow_kokkos(atomKK->k_tag,atomKK->tag,nmax,"atom:tag");
		memoryKK->grow_kokkos(atomKK->k_type,atomKK->type,nmax,"atom:type");
		@@ -74,7 +75,7 @@ void AtomVecAtomicKokkos::grow(int n)
		memoryKK->grow_kokkos(atomKK->k_f,atomKK->f,nmax,3,"atom:f");

		grow_reset();
		sync(Host,ALL_MASK);
		atomKK->sync(Host,ALL_MASK);

		if (atom->nextra_grow)
		for (int iextra = 0; iextra < atom->nextra_grow; iextra++)
		@@ -393,9 +394,9 @@ struct AtomVecAtomicKokkos_UnpackBorder {

		void AtomVecAtomicKokkos::unpack_border_kokkos(const int &n, const int &first,
		const DAT::tdual_xfloat_2d &buf,ExecutionSpace space) {
		modified(space,X_MASK\|TAG_MASK\|TYPE_MASK\|MASK_MASK);
		atomKK->modified(space,X_MASK\|TAG_MASK\|TYPE_MASK\|MASK_MASK);
		while (first+n >= nmax) grow(0);
		modified(space,X_MASK\|TAG_MASK\|TYPE_MASK\|MASK_MASK);
		atomKK->modified(space,X_MASK\|TAG_MASK\|TYPE_MASK\|MASK_MASK);
		if(space==Host) {
		struct AtomVecAtomicKokkos_UnpackBorder<LMPHostType> f(buf.view<LMPHostType>(),h_x,h_tag,h_type,h_mask,first);
		Kokkos::parallel_for(n,f);
		@@ -415,7 +416,7 @@ void AtomVecAtomicKokkos::unpack_border(int n, int first, double *buf)
		last = first + n;
		for (i = first; i < last; i++) {
		if (i == nmax) grow(0);
		modified(Host,X_MASK\|TAG_MASK\|TYPE_MASK\|MASK_MASK);
		atomKK->modified(Host,X_MASK\|TAG_MASK\|TYPE_MASK\|MASK_MASK);
		h_x(i,0) = buf[m++];
		h_x(i,1) = buf[m++];
		h_x(i,2) = buf[m++];
		@@ -440,7 +441,7 @@ void AtomVecAtomicKokkos::unpack_border_vel(int n, int first, double *buf)
		last = first + n;
		for (i = first; i < last; i++) {
		if (i == nmax) grow(0);
		modified(Host,X_MASK\|V_MASK\|TAG_MASK\|TYPE_MASK\|MASK_MASK);
		atomKK->modified(Host,X_MASK\|V_MASK\|TAG_MASK\|TYPE_MASK\|MASK_MASK);
		h_x(i,0) = buf[m++];
		h_x(i,1) = buf[m++];
		h_x(i,2) = buf[m++];
		@@ -668,7 +669,7 @@ int AtomVecAtomicKokkos::unpack_exchange(double *buf)
		{
		int nlocal = atom->nlocal;
		if (nlocal == nmax) grow(0);
		modified(Host,X_MASK \| V_MASK \| TAG_MASK \| TYPE_MASK \|
		atomKK->modified(Host,X_MASK \| V_MASK \| TAG_MASK \| TYPE_MASK \|
		MASK_MASK \| IMAGE_MASK);

		int m = 1;
		@@ -720,7 +721,7 @@ int AtomVecAtomicKokkos::size_restart()

		int AtomVecAtomicKokkos::pack_restart(int i, double *buf)
		{
		sync(Host,X_MASK \| V_MASK \| TAG_MASK \| TYPE_MASK \|
		atomKK->sync(Host,X_MASK \| V_MASK \| TAG_MASK \| TYPE_MASK \|
		MASK_MASK \| IMAGE_MASK );

		int m = 1;
		@@ -755,7 +756,7 @@ int AtomVecAtomicKokkos::unpack_restart(double *buf)
		if (atom->nextra_store)
		memory->grow(atom->extra,nmax,atom->nextra_store,"atom:extra");
		}
		modified(Host,X_MASK \| V_MASK \| TAG_MASK \| TYPE_MASK \|
		atomKK->modified(Host,X_MASK \| V_MASK \| TAG_MASK \| TYPE_MASK \|
		MASK_MASK \| IMAGE_MASK );

		int m = 1;

Admin message