Commit f7fb7c1a authored by David S. Miller's avatar David S. Miller
Browse files


Daniel Borkmann says:

====================
pull-request: bpf-next 2019-03-04

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add AF_XDP support to libbpf. Rationale is to facilitate writing
   AF_XDP applications by offering higher-level APIs that hide many
   of the details of the AF_XDP uapi. Sample programs are converted
   over to this new interface as well, from Magnus.

2) Introduce a new cant_sleep() macro for annotation of functions
   that cannot sleep and use it in BPF_PROG_RUN() to assert that
   BPF programs run under preemption disabled context, from Peter.

3) Introduce per BPF prog stats in order to monitor the usage
   of BPF; this is controlled by kernel.bpf_stats_enabled sysctl
   knob where monitoring tools can make use of this to efficiently
   determine the average cost of programs, from Alexei.

4) Split up BPF selftest's test_progs similarly as we already
   did with test_verifier. This allows to further reduce merge
   conflicts in future and to get more structure into our
   quickly growing BPF selftest suite, from Stanislav.

5) Fix a bug in BTF's dedup algorithm which can cause an infinite
   loop in some circumstances; also various BPF doc fixes and
   improvements, from Andrii.

6) Various BPF sample cleanups and migration to libbpf in order
   to further isolate the old sample loader code (so we can get
   rid of it at some point), from Jakub.

7) Add a new BPF helper for BPF cgroup skb progs that allows
   to set ECN CE code point and a Host Bandwidth Manager (HBM)
   sample program for limiting the bandwidth used by v2 cgroups,
   from Lawrence.

8) Enable write access to skb->queue_mapping from tc BPF egress
   programs in order to let BPF pick TX queue, from Jesper.

9) Fix a bug in BPF spinlock handling for map-in-map which did
   not propagate spin_lock_off to the meta map, from Yonghong.

10) Fix a bug in the new per-CPU BPF prog counters to properly
    initialize stats for each CPU, from Eric.

11) Add various BPF helper prototypes to selftest's bpf_helpers.h,
    from Willem.

12) Fix various BPF samples bugs in XDP and tracing progs,
    from Toke, Daniel and Yonghong.

13) Silence preemption splat in test_bpf after BPF_PROG_RUN()
    enforces it now everywhere, from Anders.

14) Fix a signedness bug in libbpf's btf_dedup_ref_type() to
    get error handling working, from Dan.

15) Fix bpftool documentation and auto-completion with regards
    to stream_{verdict,parser} attach types, from Alban.
====================

Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents 8c4238df 87dab7c3
Loading
Loading
Loading
Loading
+12 −12
Original line number Diff line number Diff line
@@ -36,27 +36,27 @@ consideration important quirks of other architectures) and
defines calling convention that is compatible with C calling
convention of the linux kernel on those architectures.

Q: can multiple return values be supported in the future?
Q: Can multiple return values be supported in the future?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A: NO. BPF allows only register R0 to be used as return value.

Q: can more than 5 function arguments be supported in the future?
Q: Can more than 5 function arguments be supported in the future?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A: NO. BPF calling convention only allows registers R1-R5 to be used
as arguments. BPF is not a standalone instruction set.
(unlike x64 ISA that allows msft, cdecl and other conventions)

Q: can BPF programs access instruction pointer or return address?
Q: Can BPF programs access instruction pointer or return address?
-----------------------------------------------------------------
A: NO.

Q: can BPF programs access stack pointer ?
Q: Can BPF programs access stack pointer ?
------------------------------------------
A: NO.

Only frame pointer (register R10) is accessible.
From compiler point of view it's necessary to have stack pointer.
For example LLVM defines register R11 as stack pointer in its
For example, LLVM defines register R11 as stack pointer in its
BPF backend, but it makes sure that generated code never uses it.

Q: Does C-calling convention diminishes possible use cases?
@@ -66,8 +66,8 @@ A: YES.
BPF design forces addition of major functionality in the form
of kernel helper functions and kernel objects like BPF maps with
seamless interoperability between them. It lets kernel call into
BPF programs and programs call kernel helpers with zero overhead.
As all of them were native C code. That is particularly the case
BPF programs and programs call kernel helpers with zero overhead,
as all of them were native C code. That is particularly the case
for JITed BPF programs that are indistinguishable from
native kernel C code.

@@ -75,9 +75,9 @@ Q: Does it mean that 'innovative' extensions to BPF code are disallowed?
------------------------------------------------------------------------
A: Soft yes.

At least for now until BPF core has support for
At least for now, until BPF core has support for
bpf-to-bpf calls, indirect calls, loops, global variables,
jump tables, read only sections and all other normal constructs
jump tables, read-only sections, and all other normal constructs
that C code can produce.

Q: Can loops be supported in a safe way?
@@ -109,16 +109,16 @@ For example why BPF_JNE and other compare and jumps are not cpu-like?
A: This was necessary to avoid introducing flags into ISA which are
impossible to make generic and efficient across CPU architectures.

Q: why BPF_DIV instruction doesn't map to x64 div?
Q: Why BPF_DIV instruction doesn't map to x64 div?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A: Because if we picked one-to-one relationship to x64 it would have made
it more complicated to support on arm64 and other archs. Also it
needs div-by-zero runtime check.

Q: why there is no BPF_SDIV for signed divide operation?
Q: Why there is no BPF_SDIV for signed divide operation?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A: Because it would be rarely used. llvm errors in such case and
prints a suggestion to use unsigned divide instead
prints a suggestion to use unsigned divide instead.

Q: Why BPF has implicit prologue and epilogue?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+147 −169

File changed.

Preview size limit exceeded, changes collapsed.

+35 −1
Original line number Diff line number Diff line
@@ -295,6 +295,41 @@ using::
For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
can be displayed with "-h", as usual.

FAQ
=======

Q: I am not seeing any traffic on the socket. What am I doing wrong?

A: When a netdev of a physical NIC is initialized, Linux usually
   allocates one Rx and Tx queue pair per core. So on a 8 core system,
   queue ids 0 to 7 will be allocated, one per core. In the AF_XDP
   bind call or the xsk_socket__create libbpf function call, you
   specify a specific queue id to bind to and it is only the traffic
   towards that queue you are going to get on you socket. So in the
   example above, if you bind to queue 0, you are NOT going to get any
   traffic that is distributed to queues 1 through 7. If you are
   lucky, you will see the traffic, but usually it will end up on one
   of the queues you have not bound to.

   There are a number of ways to solve the problem of getting the
   traffic you want to the queue id you bound to. If you want to see
   all the traffic, you can force the netdev to only have 1 queue, queue
   id 0, and then bind to queue 0. You can use ethtool to do this::

   sudo ethtool -L <interface> combined 1

   If you want to only see part of the traffic, you can program the
   NIC through ethtool to filter out your traffic to a single queue id
   that you can bind your XDP socket to. Here is one example in which
   UDP traffic to and from port 4242 are sent to queue 2::

   sudo ethtool -N <interface> rx-flow-hash udp4 fn
   sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \
   4242 action 2

   A number of other ways are possible all up to the capabilitites of
   the NIC you have.

Credits
=======

@@ -309,4 +344,3 @@ Credits
- Michael S. Tsirkin
- Qi Z Zhang
- Willem de Bruijn
+1 −1
Original line number Diff line number Diff line
@@ -829,7 +829,7 @@ tracing filters may do to maintain counters of events, for example. Register R9
is not used by socket filters either, but more complex filters may be running
out of registers and would have to resort to spill/fill to stack.

Internal BPF can used as generic assembler for last step performance
Internal BPF can be used as a generic assembler for last step performance
optimizations, socket filters and seccomp are using it as assembler. Tracing
filters may use it as assembler to generate code from kernel. In kernel usage
may not be bounded by security considerations, since generated internal BPF code
+9 −0
Original line number Diff line number Diff line
@@ -16,6 +16,7 @@
#include <linux/rbtree_latch.h>
#include <linux/numa.h>
#include <linux/wait.h>
#include <linux/u64_stats_sync.h>

struct bpf_verifier_env;
struct perf_event;
@@ -340,6 +341,12 @@ enum bpf_cgroup_storage_type {

#define MAX_BPF_CGROUP_STORAGE_TYPE __BPF_CGROUP_STORAGE_MAX

struct bpf_prog_stats {
	u64 cnt;
	u64 nsecs;
	struct u64_stats_sync syncp;
};

struct bpf_prog_aux {
	atomic_t refcnt;
	u32 used_map_cnt;
@@ -389,6 +396,7 @@ struct bpf_prog_aux {
	 * main prog always has linfo_idx == 0
	 */
	u32 linfo_idx;
	struct bpf_prog_stats __percpu *stats;
	union {
		struct work_struct work;
		struct rcu_head	rcu;
@@ -559,6 +567,7 @@ void bpf_map_area_free(void *base);
void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr);

extern int sysctl_unprivileged_bpf_disabled;
extern int sysctl_bpf_stats_enabled;

int bpf_map_new_fd(struct bpf_map *map, int flags);
int bpf_prog_new_fd(struct bpf_prog *prog);
Loading