Commit 63bef48f authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge branch 'akpm' (patches from Andrew)

Merge more updates from Andrew Morton:

 - a lot more of MM, quite a bit more yet to come: (memcg, pagemap,
   vmalloc, pagealloc, migration, thp, ksm, madvise, virtio,
   userfaultfd, memory-hotplug, shmem, rmap, zswap, zsmalloc, cleanups)

 - various other subsystems (procfs, misc, MAINTAINERS, bitops, lib,
   checkpatch, epoll, binfmt, kallsyms, reiserfs, kmod, gcov, kconfig,
   ubsan, fault-injection, ipc)

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (158 commits)
  ipc/shm.c: make compat_ksys_shmctl() static
  ipc/mqueue.c: fix a brace coding style issue
  lib/Kconfig.debug: fix a typo "capabilitiy" -> "capability"
  ubsan: include bug type in report header
  kasan: unset panic_on_warn before calling panic()
  ubsan: check panic_on_warn
  drivers/misc/lkdtm/bugs.c: add arithmetic overflow and array bounds checks
  ubsan: split "bounds" checker from other options
  ubsan: add trap instrumentation option
  init/Kconfig: clean up ANON_INODES and old IO schedulers options
  kernel/gcov/fs.c: replace zero-length array with flexible-array member
  gcov: gcc_3_4: replace zero-length array with flexible-array member
  gcov: gcc_4_7: replace zero-length array with flexible-array member
  kernel/kmod.c: fix a typo "assuems" -> "assumes"
  reiserfs: clean up several indentation issues
  kallsyms: unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()
  samples/hw_breakpoint: drop use of kallsyms_lookup_name()
  samples/hw_breakpoint: drop HW_BREAKPOINT_R when reporting writes
  fs/binfmt_elf.c: don't free interpreter's ELF pheaders on common path
  fs/binfmt_elf.c: allocate less for static executable
  ...
parents 04de788e 1cd377ba
Loading
Loading
Loading
Loading
+11 −2
Original line number Diff line number Diff line
@@ -2573,13 +2573,22 @@
			For details see: Documentation/admin-guide/hw-vuln/mds.rst

	mem=nn[KMG]	[KNL,BOOT] Force usage of a specific amount of memory
			Amount of memory to be used when the kernel is not able
			to see the whole system memory or for test.
			Amount of memory to be used in cases as follows:

			1 for test;
			2 when the kernel is not able to see the whole system memory;
			3 memory that lies after 'mem=' boundary is excluded from
			 the hypervisor, then assigned to KVM guests.

			[X86] Work as limiting max address. Use together
			with memmap= to avoid physical address space collisions.
			Without memmap= PCI devices could be placed at addresses
			belonging to unused RAM.

			Note that this only takes effects during boot time since
			in above case 3, memory may need be hot added after boot
			if system memory of hypervisor is not sufficient.

	mem=nopentium	[BUGS=X86-32] Disable usage of 4MB pages for kernel
			memory.

+14 −0
Original line number Diff line number Diff line
@@ -310,6 +310,11 @@ thp_fault_fallback
	is incremented if a page fault fails to allocate
	a huge page and instead falls back to using small pages.

thp_fault_fallback_charge
	is incremented if a page fault fails to charge a huge page and
	instead falls back to using small pages even though the
	allocation was successful.

thp_collapse_alloc_failed
	is incremented if khugepaged found a range
	of pages that should be collapsed into one huge page but failed
@@ -319,6 +324,15 @@ thp_file_alloc
	is incremented every time a file huge page is successfully
	allocated.

thp_file_fallback
	is incremented if a file huge page is attempted to be allocated
	but fails and instead falls back to using small pages.

thp_file_fallback_charge
	is incremented if a file huge page cannot be charged and instead
	falls back to using small pages even though the allocation was
	successful.

thp_file_mapped
	is incremented every time a file huge page is mapped into
	user address space.
+51 −0
Original line number Diff line number Diff line
@@ -108,6 +108,57 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
half copied page since it'll keep userfaulting until the copy has
finished.

Notes:

- If you requested UFFDIO_REGISTER_MODE_MISSING when registering then
  you must provide some kind of page in your thread after reading from
  the uffd.  You must provide either UFFDIO_COPY or UFFDIO_ZEROPAGE.
  The normal behavior of the OS automatically providing a zero page on
  an annonymous mmaping is not in place.

- None of the page-delivering ioctls default to the range that you
  registered with.  You must fill in all fields for the appropriate
  ioctl struct including the range.

- You get the address of the access that triggered the missing page
  event out of a struct uffd_msg that you read in the thread from the
  uffd.  You can supply as many pages as you want with UFFDIO_COPY or
  UFFDIO_ZEROPAGE.  Keep in mind that unless you used DONTWAKE then
  the first of any of those IOCTLs wakes up the faulting thread.

- Be sure to test for all errors including (pollfd[0].revents &
  POLLERR).  This can happen, e.g. when ranges supplied were
  incorrect.

Write Protect Notifications
---------------------------

This is equivalent to (but faster than) using mprotect and a SIGSEGV
signal handler.

Firstly you need to register a range with UFFDIO_REGISTER_MODE_WP.
Instead of using mprotect(2) you use ioctl(uffd, UFFDIO_WRITEPROTECT,
struct *uffdio_writeprotect) while mode = UFFDIO_WRITEPROTECT_MODE_WP
in the struct passed in.  The range does not default to and does not
have to be identical to the range you registered with.  You can write
protect as many ranges as you like (inside the registered range).
Then, in the thread reading from uffd the struct will have
msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP set. Now you send
ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect) again
while pagefault.mode does not have UFFDIO_WRITEPROTECT_MODE_WP set.
This wakes up the thread which will continue to run with writes. This
allows you to do the bookkeeping about the write in the uffd reading
thread before the ioctl.

If you registered with both UFFDIO_REGISTER_MODE_MISSING and
UFFDIO_REGISTER_MODE_WP then you need to think about the sequence in
which you supply a page and undo write protect.  Note that there is a
difference between writes into a WP area and into a !WP area.  The
former will have UFFD_PAGEFAULT_FLAG_WP set, the latter
UFFD_PAGEFAULT_FLAG_WRITE.  The latter did not fail on protection but
you still need to supply a page when UFFDIO_REGISTER_MODE_MISSING was
used.

QEMU/KVM
========

+40 −0
Original line number Diff line number Diff line
.. _free_page_reporting:

=====================
Free Page Reporting
=====================

Free page reporting is an API by which a device can register to receive
lists of pages that are currently unused by the system. This is useful in
the case of virtualization where a guest is then able to use this data to
notify the hypervisor that it is no longer using certain pages in memory.

For the driver, typically a balloon driver, to use of this functionality
it will allocate and initialize a page_reporting_dev_info structure. The
field within the structure it will populate is the "report" function
pointer used to process the scatterlist. It must also guarantee that it can
handle at least PAGE_REPORTING_CAPACITY worth of scatterlist entries per
call to the function. A call to page_reporting_register will register the
page reporting interface with the reporting framework assuming no other
page reporting devices are already registered.

Once registered the page reporting API will begin reporting batches of
pages to the driver. The API will start reporting pages 2 seconds after
the interface is registered and will continue to do so 2 seconds after any
page of a sufficiently high order is freed.

Pages reported will be stored in the scatterlist passed to the reporting
function with the final entry having the end bit set in entry nent - 1.
While pages are being processed by the report function they will not be
accessible to the allocator. Once the report function has been completed
the pages will be returned to the free area from which they were obtained.

Prior to removing a driver that is making use of free page reporting it
is necessary to call page_reporting_unregister to have the
page_reporting_dev_info structure that is currently in use by free page
reporting removed. Doing this will prevent further reports from being
issued via the interface. If another driver or the same driver is
registered it is possible for it to resume where the previous driver had
left off in terms of reporting free pages.

Alexander Duyck, Dec 04, 2019
+12 −8
Original line number Diff line number Diff line
@@ -35,9 +35,11 @@ Zswap evicts pages from compressed cache on an LRU basis to the backing swap
device when the compressed pool reaches its size limit.  This requirement had
been identified in prior community discussions.

Zswap is disabled by default but can be enabled at boot time by setting
the ``enabled`` attribute to 1 at boot time. ie: ``zswap.enabled=1``.  Zswap
can also be enabled and disabled at runtime using the sysfs interface.
Whether Zswap is enabled at the boot time depends on whether
the ``CONFIG_ZSWAP_DEFAULT_ON`` Kconfig option is enabled or not.
This setting can then be overridden by providing the kernel command line
``zswap.enabled=`` option, for example ``zswap.enabled=0``.
Zswap can also be enabled and disabled at runtime using the sysfs interface.
An example command to enable zswap at runtime, assuming sysfs is mounted
at ``/sys``, is::

@@ -64,9 +66,10 @@ allocation in zpool is not directly accessible by address. Rather, a handle is
returned by the allocation routine and that handle must be mapped before being
accessed.  The compressed memory pool grows on demand and shrinks as compressed
pages are freed.  The pool is not preallocated.  By default, a zpool
of type zbud is created, but it can be selected at boot time by
setting the ``zpool`` attribute, e.g. ``zswap.zpool=zbud``. It can
also be changed at runtime using the sysfs ``zpool`` attribute, e.g.::
of type selected in ``CONFIG_ZSWAP_ZPOOL_DEFAULT`` Kconfig option is created,
but it can be overridden at boot time by setting the ``zpool`` attribute,
e.g. ``zswap.zpool=zbud``. It can also be changed at runtime using the sysfs
``zpool`` attribute, e.g.::

	echo zbud > /sys/module/zswap/parameters/zpool

@@ -97,8 +100,9 @@ controlled policy:
* max_pool_percent - The maximum percentage of memory that the compressed
  pool can occupy.

The default compressor is lzo, but it can be selected at boot time by
setting the ``compressor`` attribute, e.g. ``zswap.compressor=lzo``.
The default compressor is selected in ``CONFIG_ZSWAP_COMPRESSOR_DEFAULT``
Kconfig option, but it can be overridden at boot time by setting the
``compressor`` attribute, e.g. ``zswap.compressor=lzo``.
It can also be changed at runtime using the sysfs "compressor"
attribute, e.g.::

Loading