Commit 150f29f5 authored by David S. Miller's avatar David S. Miller
Browse files


Daniel Borkmann says:

====================
pull-request: bpf-next 2020-09-01

The following pull-request contains BPF updates for your *net-next* tree.

There are two small conflicts when pulling, resolve as follows:

1) Merge conflict in tools/lib/bpf/libbpf.c between 88a82120 ("libbpf: Factor
   out common ELF operations and improve logging") in bpf-next and 1e891e51
   ("libbpf: Fix map index used in error message") in net-next. Resolve by taking
   the hunk in bpf-next:

        [...]
        scn = elf_sec_by_idx(obj, obj->efile.btf_maps_shndx);
        data = elf_sec_data(obj, scn);
        if (!scn || !data) {
                pr_warn("elf: failed to get %s map definitions for %s\n",
                        MAPS_ELF_SEC, obj->path);
                return -EINVAL;
        }
        [...]

2) Merge conflict in drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c between
   9647c57b ("xsk: i40e: ice: ixgbe: mlx5: Test for dma_need_sync earlier for
   better performance") in bpf-next and e20f0dbf ("net/mlx5e: RX, Add a prefetch
   command for small L1_CACHE_BYTES") in net-next. Resolve the two locations by retaining
   net_prefetch() and taking xsk_buff_dma_sync_for_cpu() from bpf-next. Should look like:

        [...]
        xdp_set_data_meta_invalid(xdp);
        xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool);
        net_prefetch(xdp->data);
        [...]

We've added 133 non-merge commits during the last 14 day(s) which contain
a total of 246 files changed, 13832 insertions(+), 3105 deletions(-).

The main changes are:

1) Initial support for sleepable BPF programs along with bpf_copy_from_user() helper
   for tracing to reliably access user memory, from Alexei Starovoitov.

2) Add BPF infra for writing and parsing TCP header options, from Martin KaFai Lau.

3) bpf_d_path() helper for returning full path for given 'struct path', from Jiri Olsa.

4) AF_XDP support for shared umems between devices and queues, from Magnus Karlsson.

5) Initial prep work for full BPF-to-BPF call support in libbpf, from Andrii Nakryiko.

6) Generalize bpf_sk_storage map & add local storage for inodes, from KP Singh.

7) Implement sockmap/hash updates from BPF context, from Lorenz Bauer.

8) BPF xor verification for scalar types & add BPF link iterator, from Yonghong Song.

9) Use target's prog type for BPF_PROG_TYPE_EXT prog verification, from Udip Pant.

10) Rework BPF tracing samples to use libbpf loader, from Daniel T. Lee.

11) Fix xdpsock sample to really cycle through all buffers, from Weqaar Janjua.

12) Improve type safety for tun/veth XDP frame handling, from Maciej Żenczykowski.

13) Various smaller cleanups and improvements all over the place.
====================

Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents 8aa639e1 ebc4ecd4
Loading
Loading
Loading
Loading
+12 −7
Original line number Diff line number Diff line
@@ -149,7 +149,7 @@ In case the patch or patch series has to be reworked and sent out
again in a second or later revision, it is also required to add a
version number (``v2``, ``v3``, ...) into the subject prefix::

  git format-patch --subject-prefix='PATCH net-next v2' start..finish
  git format-patch --subject-prefix='PATCH bpf-next v2' start..finish

When changes have been requested to the patch series, always send the
whole patch series again with the feedback incorporated (never send
@@ -479,12 +479,13 @@ LLVM's static compiler lists the supported targets through

     $ llc --version
     LLVM (http://llvm.org/):
       LLVM version 6.0.0svn
       LLVM version 10.0.0
       Optimized build.
       Default target: x86_64-unknown-linux-gnu
       Host CPU: skylake

       Registered Targets:
         aarch64    - AArch64 (little endian)
         bpf        - BPF (host endian)
         bpfeb      - BPF (big endian)
         bpfel      - BPF (little endian)
@@ -517,6 +518,10 @@ from the git repositories::
The built binaries can then be found in the build/bin/ directory, where
you can point the PATH variable to.

Set ``-DLLVM_TARGETS_TO_BUILD`` equal to the target you wish to build, you
will find a full list of targets within the llvm-project/llvm/lib/Target
directory.

Q: Reporting LLVM BPF issues
----------------------------
Q: Should I notify BPF kernel maintainers about issues in LLVM's BPF code
+25 −0
Original line number Diff line number Diff line
@@ -724,6 +724,31 @@ want to define unused entry in BTF_ID_LIST, like::
      BTF_ID_UNUSED
      BTF_ID(struct, task_struct)

The ``BTF_SET_START/END`` macros pair defines sorted list of BTF ID values
and their count, with following syntax::

  BTF_SET_START(set)
  BTF_ID(type1, name1)
  BTF_ID(type2, name2)
  BTF_SET_END(set)

resulting in following layout in .BTF_ids section::

  __BTF_ID__set__set:
  .zero 4
  __BTF_ID__type1__name1__3:
  .zero 4
  __BTF_ID__type2__name2__4:
  .zero 4

The ``struct btf_id_set set;`` variable is defined to access the list.

The ``typeX`` name can be one of following::

   struct, union, typedef, func

and is used as a filter when resolving the BTF ID value.

All the BTF ID lists and sets are compiled in the .BTF_ids section and
resolved during the linking phase of kernel build by ``resolve_btfids`` tool.

+1 −0
Original line number Diff line number Diff line
@@ -52,6 +52,7 @@ Program types
   prog_cgroup_sysctl
   prog_flow_dissector
   bpf_lsm
   prog_sk_lookup


Map types
+98 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)

=====================
BPF sk_lookup program
=====================

BPF sk_lookup program type (``BPF_PROG_TYPE_SK_LOOKUP``) introduces programmability
into the socket lookup performed by the transport layer when a packet is to be
delivered locally.

When invoked BPF sk_lookup program can select a socket that will receive the
incoming packet by calling the ``bpf_sk_assign()`` BPF helper function.

Hooks for a common attach point (``BPF_SK_LOOKUP``) exist for both TCP and UDP.

Motivation
==========

BPF sk_lookup program type was introduced to address setup scenarios where
binding sockets to an address with ``bind()`` socket call is impractical, such
as:

1. receiving connections on a range of IP addresses, e.g. 192.0.2.0/24, when
   binding to a wildcard address ``INADRR_ANY`` is not possible due to a port
   conflict,
2. receiving connections on all or a wide range of ports, i.e. an L7 proxy use
   case.

Such setups would require creating and ``bind()``'ing one socket to each of the
IP address/port in the range, leading to resource consumption and potential
latency spikes during socket lookup.

Attachment
==========

BPF sk_lookup program can be attached to a network namespace with
``bpf(BPF_LINK_CREATE, ...)`` syscall using the ``BPF_SK_LOOKUP`` attach type and a
netns FD as attachment ``target_fd``.

Multiple programs can be attached to one network namespace. Programs will be
invoked in the same order as they were attached.

Hooks
=====

The attached BPF sk_lookup programs run whenever the transport layer needs to
find a listening (TCP) or an unconnected (UDP) socket for an incoming packet.

Incoming traffic to established (TCP) and connected (UDP) sockets is delivered
as usual without triggering the BPF sk_lookup hook.

The attached BPF programs must return with either ``SK_PASS`` or ``SK_DROP``
verdict code. As for other BPF program types that are network filters,
``SK_PASS`` signifies that the socket lookup should continue on to regular
hashtable-based lookup, while ``SK_DROP`` causes the transport layer to drop the
packet.

A BPF sk_lookup program can also select a socket to receive the packet by
calling ``bpf_sk_assign()`` BPF helper. Typically, the program looks up a socket
in a map holding sockets, such as ``SOCKMAP`` or ``SOCKHASH``, and passes a
``struct bpf_sock *`` to ``bpf_sk_assign()`` helper to record the
selection. Selecting a socket only takes effect if the program has terminated
with ``SK_PASS`` code.

When multiple programs are attached, the end result is determined from return
codes of all the programs according to the following rules:

1. If any program returned ``SK_PASS`` and selected a valid socket, the socket
   is used as the result of the socket lookup.
2. If more than one program returned ``SK_PASS`` and selected a socket, the last
   selection takes effect.
3. If any program returned ``SK_DROP``, and no program returned ``SK_PASS`` and
   selected a socket, socket lookup fails.
4. If all programs returned ``SK_PASS`` and none of them selected a socket,
   socket lookup continues on.

API
===

In its context, an instance of ``struct bpf_sk_lookup``, BPF sk_lookup program
receives information about the packet that triggered the socket lookup. Namely:

* IP version (``AF_INET`` or ``AF_INET6``),
* L4 protocol identifier (``IPPROTO_TCP`` or ``IPPROTO_UDP``),
* source and destination IP address,
* source and destination L4 port,
* the socket that has been selected with ``bpf_sk_assign()``.

Refer to ``struct bpf_sk_lookup`` declaration in ``linux/bpf.h`` user API
header, and `bpf-helpers(7)
<https://man7.org/linux/man-pages/man7/bpf-helpers.7.html>`_ man-page section
for ``bpf_sk_assign()`` for details.

Example
=======

See ``tools/testing/selftests/bpf/prog_tests/sk_lookup.c`` for the reference
implementation.
+58 −10
Original line number Diff line number Diff line
@@ -258,14 +258,21 @@ socket into zero-copy mode or fail.
XDP_SHARED_UMEM bind flag
-------------------------

This flag enables you to bind multiple sockets to the same UMEM, but
only if they share the same queue id. In this mode, each socket has
their own RX and TX rings, but the UMEM (tied to the fist socket
created) only has a single FILL ring and a single COMPLETION
ring. To use this mode, create the first socket and bind it in the normal
way. Create a second socket and create an RX and a TX ring, or at
least one of them, but no FILL or COMPLETION rings as the ones from
the first socket will be used. In the bind call, set he
This flag enables you to bind multiple sockets to the same UMEM. It
works on the same queue id, between queue ids and between
netdevs/devices. In this mode, each socket has their own RX and TX
rings as usual, but you are going to have one or more FILL and
COMPLETION ring pairs. You have to create one of these pairs per
unique netdev and queue id tuple that you bind to.

Starting with the case were we would like to share a UMEM between
sockets bound to the same netdev and queue id. The UMEM (tied to the
fist socket created) will only have a single FILL ring and a single
COMPLETION ring as there is only on unique netdev,queue_id tuple that
we have bound to. To use this mode, create the first socket and bind
it in the normal way. Create a second socket and create an RX and a TX
ring, or at least one of them, but no FILL or COMPLETION rings as the
ones from the first socket will be used. In the bind call, set he
XDP_SHARED_UMEM option and provide the initial socket's fd in the
sxdp_shared_umem_fd field. You can attach an arbitrary number of extra
sockets this way.
@@ -305,11 +312,41 @@ concurrently. There are no synchronization primitives in the
libbpf code that protects multiple users at this point in time.

Libbpf uses this mode if you create more than one socket tied to the
same umem. However, note that you need to supply the
same UMEM. However, note that you need to supply the
XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD libbpf_flag with the
xsk_socket__create calls and load your own XDP program as there is no
built in one in libbpf that will route the traffic for you.

The second case is when you share a UMEM between sockets that are
bound to different queue ids and/or netdevs. In this case you have to
create one FILL ring and one COMPLETION ring for each unique
netdev,queue_id pair. Let us say you want to create two sockets bound
to two different queue ids on the same netdev. Create the first socket
and bind it in the normal way. Create a second socket and create an RX
and a TX ring, or at least one of them, and then one FILL and
COMPLETION ring for this socket. Then in the bind call, set he
XDP_SHARED_UMEM option and provide the initial socket's fd in the
sxdp_shared_umem_fd field as you registered the UMEM on that
socket. These two sockets will now share one and the same UMEM.

There is no need to supply an XDP program like the one in the previous
case where sockets were bound to the same queue id and
device. Instead, use the NIC's packet steering capabilities to steer
the packets to the right queue. In the previous example, there is only
one queue shared among sockets, so the NIC cannot do this steering. It
can only steer between queues.

In libbpf, you need to use the xsk_socket__create_shared() API as it
takes a reference to a FILL ring and a COMPLETION ring that will be
created for you and bound to the shared UMEM. You can use this
function for all the sockets you create, or you can use it for the
second and following ones and use xsk_socket__create() for the first
one. Both methods yield the same result.

Note that a UMEM can be shared between sockets on the same queue id
and device, as well as between queues on the same device and between
devices at the same time.

XDP_USE_NEED_WAKEUP bind flag
-----------------------------

@@ -364,7 +401,7 @@ resources by only setting up one of them. Both the FILL ring and the
COMPLETION ring are mandatory as you need to have a UMEM tied to your
socket. But if the XDP_SHARED_UMEM flag is used, any socket after the
first one does not have a UMEM and should in that case not have any
FILL or COMPLETION rings created as the ones from the shared umem will
FILL or COMPLETION rings created as the ones from the shared UMEM will
be used. Note, that the rings are single-producer single-consumer, so
do not try to access them from multiple processes at the same
time. See the XDP_SHARED_UMEM section.
@@ -567,6 +604,17 @@ A: The short answer is no, that is not supported at the moment. The
   switch, or other distribution mechanism, in your NIC to direct
   traffic to the correct queue id and socket.

Q: My packets are sometimes corrupted. What is wrong?

A: Care has to be taken not to feed the same buffer in the UMEM into
   more than one ring at the same time. If you for example feed the
   same buffer into the FILL ring and the TX ring at the same time, the
   NIC might receive data into the buffer at the same time it is
   sending it. This will cause some packets to become corrupted. Same
   thing goes for feeding the same buffer into the FILL rings
   belonging to different queue ids or netdevs bound with the
   XDP_SHARED_UMEM flag.

Credits
=======

Loading