Commit 2aec17f1 authored by David S. Miller's avatar David S. Miller
Browse files

Merge branch 'fix-indirect-flow_block-infrastructure'



Pablo Neira Ayuso says:

====================
the indirect flow_block infrastructure, revisited

This series fixes b5140a36 ("netfilter: flowtable: add indr block
setup support") that adds support for the indirect block for the
flowtable. This patch crashes the kernel with the TC CT action.

[  630.908086] BUG: kernel NULL pointer dereference, address: 00000000000000f0
[  630.908233] #PF: error_code(0x0000) - not-present page
[  630.908304] PGD 800000104addd067 P4D 800000104addd067 PUD 104311d067 PMD 0
[  630.908380] Oops: 0000 [#1] SMP PTI [  630.908615] RIP: 0010:nf_flow_table_indr_block_cb+0xc0/0x190 [nf_flow_table]
[  630.908690] Code: 5b 41 5c 41 5d 41 5e 41 5f 5d c3 4c 89 75 a0 4c 89 65 a8 4d 89 ee 49 89 dd 4c 89 fe 48 c7 c7 b7 64 36 a0 31 c0 e8 ce ed d8 e0 <49> 8b b7 f0 00 00 00 48 c7 c7 c8 64      36 a0 31 c0 e8 b9 ed d8 e0 49[  630.908790] RSP: 0018:ffffc9000895f8c0 EFLAGS: 00010246
[...]
[  630.910774] Call Trace:
[  630.911192]  ? mlx5e_rep_indr_setup_block+0x270/0x270 [mlx5_core]
[  630.911621]  ? mlx5e_rep_indr_setup_block+0x270/0x270 [mlx5_core]
[  630.912040]  ? mlx5e_rep_indr_setup_block+0x270/0x270 [mlx5_core]
[  630.912443]  flow_block_cmd+0x51/0x80
[  630.912844]  __flow_indr_block_cb_register+0x26c/0x510
[  630.913265]  mlx5e_nic_rep_netdevice_event+0x9e/0x110 [mlx5_core]
[  630.913665]  notifier_call_chain+0x53/0xa0
[  630.914063]  raw_notifier_call_chain+0x16/0x20
[  630.914466]  call_netdevice_notifiers_info+0x39/0x90
[  630.914859]  register_netdevice+0x484/0x550
[  630.915256]  __ip_tunnel_create+0x12b/0x1f0 [ip_tunnel]
[  630.915661]  ip_tunnel_init_net+0x116/0x180 [ip_tunnel]
[  630.916062]  ipgre_tap_init_net+0x22/0x30 [ip_gre]
[  630.916458]  ops_init+0x44/0x110
[  630.916851]  register_pernet_operations+0x112/0x200

A workaround patch to cure this crash has been proposed. However, there
is another problem: The indirect flow_block still does not work for the
new TC CT action. The problem is that the existing flow_indr_block_entry
callback assumes you can look up for the flowtable from the netdevice to
get the flow_block. This flow_block allows you to offload the flows via
TC_SETUP_CLSFLOWER. Unfortunately, it is not possible to get the
flow_block from the TC CT flowtables because they are _not_ bound to any
specific netdevice.

= What is the indirect flow_block infrastructure?

The indirect flow_block infrastructure allows drivers to offload
tc/netfilter rules that belong to software tunnel netdevices, e.g.
vxlan.

This indirect flow_block infrastructure relates tunnel netdevices with
drivers because there is no obvious way to relate these two things
from the control plane.

= How does the indirect flow_block work before this patchset?

Front-ends register the indirect block callback through
flow_indr_add_block_cb() if they support for offloading tunnel
netdevices.

== Setting up an indirect block

1) Drivers track tunnel netdevices via NETDEV_{REGISTER,UNREGISTER} events.
   If there is a new tunnel netdevice that the driver can offload, then the
   driver invokes __flow_indr_block_cb_register() with the new tunnel
   netdevice and the driver callback. The __flow_indr_block_cb_register()
   call iterates over the list of the front-end callbacks.

2) The front-end callback sets up the flow_block_offload structure and it
   invokes the driver callback to set up the flow_block.

3) The driver callback now registers the flow_block structure and it
   returns the flow_block back to the front-end.

4) The front-end gets the flow_block object and it is now ready to
   offload rules for this tunnel netdevice.

A simplified callgraph is represented below.

        Front-end                      Driver

                                   NETDEV_REGISTER
                                         |
                     __flow_indr_block_cb_register(netdev, cb_priv, driver_cb)
                                         | [1]
            .--------------frontend_indr_block_cb(cb_priv, driver_cb)
            |
            .
   setup_flow_block_offload(bo)
            | [2]
       driver_cb(bo, cb_priv) -----------.
                                         |
                                         \/
                                  set up flow_blocks [3]
                                         |
      add rules to flow_block <----------
      TC_SETUP_CLSFLOWER [4]

== Releasing the indirect flow_block

There are two possibilities, either tunnel netdevice is removed or
a netdevice (port representor) is removed.

=== Tunnel netdevice is removed

Driver waits for the NETDEV_UNREGISTER event that announces the tunnel
netdevice removal. Then, it calls __flow_indr_block_cb_unregister() to
remove the flow_block and rules.  Callgraph is very similar to the one
described above.

=== Netdevice is removed (port representor)

Driver calls __flow_indr_block_cb_unregister() to remove the existing
netfilter/tc rule that belong to the tunnel netdevice.

= How does the indirect flow_block work after this patchset?

Drivers register the indirect flow_block setup callback through
flow_indr_dev_register() if they support for offloading tunnel
netdevices.

== Setting up an indirect flow_block

1) Frontends check if dev->netdev_ops->ndo_setup_tc is unset. If so,
   frontends call flow_indr_dev_setup_offload(). This call invokes
   the drivers' indirect flow_block setup callback.

2) The indirect flow_block setup callback sets up a flow_block structure
   which relates the tunnel netdevice and the driver.

3) The front-end uses flow_block and offload the rules.

Note that the operational to set up (non-indirect) flow_block is very
similar.

== Releasing the indirect flow_block

=== Tunnel netdevice is removed

This calls flow_indr_dev_setup_offload() to set down the flow_block and
remove the offloaded rules. This alternate path is exercised if
dev->netdev_ops->ndo_setup_tc is unset.

=== Netdevice is removed (port representor)

If a netdevice is removed, then it might need to to clean up the
offloaded tc/netfilter rules that belongs to the tunnel netdevice:

1) The driver invokes flow_indr_dev_unregister() when a netdevice is
   removed.

2) This call iterates over the existing indirect flow_blocks
   and it invokes the cleanup callback to let the front-end remove the
   tc/netfilter rules. The cleanup callback already provides the
   flow_block that the front-end needs to clean up.

        Front-end                      Driver

                                         |
                            flow_indr_dev_unregister(...)
                                         |
                         iterate over list of indirect flow_block
                               and invoke cleanup callback
                                         |
            .-----------------------------
            |
            .
   frontend_flow_block_cleanup(flow_block)
            .
            |
           \/
   remove rules to flow_block
      TC_SETUP_CLSFLOWER

= About this patchset

This patchset aims to address the existing TC CT problem while
simplifying the indirect flow_block infrastructure. Saving 300 LoC in
the flow_offload core and the drivers. The operational gets aligned with
the (non-indirect) flow_blocks logic. Patchset is composed of:

Patch #1 add nf_flow_table_gc_cleanup() which is required by the
         netfilter's flowtable new indirect flow_block approach.

Patch #2 adds the flow_block_indr object which is actually part of
         of the flow_block object. This stores the indirect flow_block
         metadata such as the tunnel netdevice owner and the cleanup
         callback (in case the tunnel netdevice goes away).

         This patch adds flow_indr_dev_{un}register() to allow drivers
         to offer netdevice tunnel hardware offload to the front-ends.
         Then, front-ends call flow_indr_dev_setup_offload() to invoke
         the drivers to set up the (indirect) flow_block.

Patch #3 add the tcf_block_offload_init() helper function, this is
         a preparation patch to adapt the tc front-end to use this
         new indirect flow_block infrastructure.

Patch #4 updates the tc and netfilter front-ends to use the new
         indirect flow_block infrastructure.

Patch #5 updates the mlx5 driver to use the new indirect flow_block
         infrastructure.

Patch #6 updates the nfp driver to use the new indirect flow_block
         infrastructure.

Patch #7 updates the bnxt driver to use the new indirect flow_block
         infrastructure.

Patch #8 removes the indirect flow_block infrastructure version 1,
         now that frontends and drivers have been translated to
         version 2 (coming in this patchset).
====================

Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents a01c2454 709ffbe1
Loading
Loading
Loading
Loading
+0 −1
Original line number Diff line number Diff line
@@ -1870,7 +1870,6 @@ struct bnxt {
	u8			dsn[8];
	struct bnxt_tc_info	*tc_info;
	struct list_head	tc_indr_block_list;
	struct notifier_block	tc_netdev_nb;
	struct dentry		*debugfs_pdev;
	struct device		*hwmon_dev;
};
+12 −39
Original line number Diff line number Diff line
@@ -1939,53 +1939,25 @@ static int bnxt_tc_setup_indr_block(struct net_device *netdev, struct bnxt *bp,
	return 0;
}

static int bnxt_tc_setup_indr_cb(struct net_device *netdev, void *cb_priv,
				 enum tc_setup_type type, void *type_data)
{
	switch (type) {
	case TC_SETUP_BLOCK:
		return bnxt_tc_setup_indr_block(netdev, cb_priv, type_data);
	default:
		return -EOPNOTSUPP;
	}
}

static bool bnxt_is_netdev_indr_offload(struct net_device *netdev)
{
	return netif_is_vxlan(netdev);
}

static int bnxt_tc_indr_block_event(struct notifier_block *nb,
				    unsigned long event, void *ptr)
static int bnxt_tc_setup_indr_cb(struct net_device *netdev, void *cb_priv,
				 enum tc_setup_type type, void *type_data)
{
	struct net_device *netdev;
	struct bnxt *bp;
	int rc;

	netdev = netdev_notifier_info_to_dev(ptr);
	if (!bnxt_is_netdev_indr_offload(netdev))
		return NOTIFY_OK;

	bp = container_of(nb, struct bnxt, tc_netdev_nb);
		return -EOPNOTSUPP;

	switch (event) {
	case NETDEV_REGISTER:
		rc = __flow_indr_block_cb_register(netdev, bp,
						   bnxt_tc_setup_indr_cb,
						   bp);
		if (rc)
			netdev_info(bp->dev,
				    "Failed to register indirect blk: dev: %s\n",
				    netdev->name);
		break;
	case NETDEV_UNREGISTER:
		__flow_indr_block_cb_unregister(netdev,
						bnxt_tc_setup_indr_cb,
						bp);
	switch (type) {
	case TC_SETUP_BLOCK:
		return bnxt_tc_setup_indr_block(netdev, cb_priv, type_data);
	default:
		break;
	}

	return NOTIFY_DONE;
	return -EOPNOTSUPP;
}

static const struct rhashtable_params bnxt_tc_flow_ht_params = {
@@ -2074,8 +2046,8 @@ int bnxt_init_tc(struct bnxt *bp)

	/* init indirect block notifications */
	INIT_LIST_HEAD(&bp->tc_indr_block_list);
	bp->tc_netdev_nb.notifier_call = bnxt_tc_indr_block_event;
	rc = register_netdevice_notifier(&bp->tc_netdev_nb);

	rc = flow_indr_dev_register(bnxt_tc_setup_indr_cb, bp);
	if (!rc)
		return 0;

@@ -2101,7 +2073,8 @@ void bnxt_shutdown_tc(struct bnxt *bp)
	if (!bnxt_tc_flower_enabled(bp))
		return;

	unregister_netdevice_notifier(&bp->tc_netdev_nb);
	flow_indr_dev_unregister(bnxt_tc_setup_indr_cb, bp,
				 bnxt_tc_setup_indr_block_cb);
	rhashtable_destroy(&tc_info->flow_table);
	rhashtable_destroy(&tc_info->l2_table);
	rhashtable_destroy(&tc_info->decap_l2_table);
+8 −73
Original line number Diff line number Diff line
@@ -306,20 +306,6 @@ mlx5e_rep_indr_block_priv_lookup(struct mlx5e_rep_priv *rpriv,
	return NULL;
}

static void mlx5e_rep_indr_unregister_block(struct mlx5e_rep_priv *rpriv,
					    struct net_device *netdev);

void mlx5e_rep_indr_clean_block_privs(struct mlx5e_rep_priv *rpriv)
{
	struct mlx5e_rep_indr_block_priv *cb_priv, *temp;
	struct list_head *head = &rpriv->uplink_priv.tc_indr_block_priv_list;

	list_for_each_entry_safe(cb_priv, temp, head, list) {
		mlx5e_rep_indr_unregister_block(rpriv, cb_priv->netdev);
		kfree(cb_priv);
	}
}

static int
mlx5e_rep_indr_offload(struct net_device *netdev,
		       struct flow_cls_offload *flower,
@@ -423,9 +409,14 @@ mlx5e_rep_indr_setup_block(struct net_device *netdev,
			   struct flow_block_offload *f,
			   flow_setup_cb_t *setup_cb)
{
	struct mlx5e_priv *priv = netdev_priv(rpriv->netdev);
	struct mlx5e_rep_indr_block_priv *indr_priv;
	struct flow_block_cb *block_cb;

	if (!mlx5e_tc_tun_device_to_offload(priv, netdev) &&
	    !(is_vlan_dev(netdev) && vlan_dev_real_dev(netdev) == rpriv->netdev))
		return -EOPNOTSUPP;

	if (f->binder_type != FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
		return -EOPNOTSUPP;

@@ -492,76 +483,20 @@ int mlx5e_rep_indr_setup_cb(struct net_device *netdev, void *cb_priv,
	}
}

static int mlx5e_rep_indr_register_block(struct mlx5e_rep_priv *rpriv,
					 struct net_device *netdev)
{
	int err;

	err = __flow_indr_block_cb_register(netdev, rpriv,
					    mlx5e_rep_indr_setup_cb,
					    rpriv);
	if (err) {
		struct mlx5e_priv *priv = netdev_priv(rpriv->netdev);

		mlx5_core_err(priv->mdev, "Failed to register remote block notifier for %s err=%d\n",
			      netdev_name(netdev), err);
	}
	return err;
}

static void mlx5e_rep_indr_unregister_block(struct mlx5e_rep_priv *rpriv,
					    struct net_device *netdev)
{
	__flow_indr_block_cb_unregister(netdev, mlx5e_rep_indr_setup_cb,
					rpriv);
}

static int mlx5e_nic_rep_netdevice_event(struct notifier_block *nb,
					 unsigned long event, void *ptr)
{
	struct mlx5e_rep_priv *rpriv = container_of(nb, struct mlx5e_rep_priv,
						     uplink_priv.netdevice_nb);
	struct mlx5e_priv *priv = netdev_priv(rpriv->netdev);
	struct net_device *netdev = netdev_notifier_info_to_dev(ptr);

	if (!mlx5e_tc_tun_device_to_offload(priv, netdev) &&
	    !(is_vlan_dev(netdev) && vlan_dev_real_dev(netdev) == rpriv->netdev))
		return NOTIFY_OK;

	switch (event) {
	case NETDEV_REGISTER:
		mlx5e_rep_indr_register_block(rpriv, netdev);
		break;
	case NETDEV_UNREGISTER:
		mlx5e_rep_indr_unregister_block(rpriv, netdev);
		break;
	}
	return NOTIFY_OK;
}

int mlx5e_rep_tc_netdevice_event_register(struct mlx5e_rep_priv *rpriv)
{
	struct mlx5_rep_uplink_priv *uplink_priv = &rpriv->uplink_priv;
	int err;

	/* init indirect block notifications */
	INIT_LIST_HEAD(&uplink_priv->tc_indr_block_priv_list);

	uplink_priv->netdevice_nb.notifier_call = mlx5e_nic_rep_netdevice_event;
	err = register_netdevice_notifier_dev_net(rpriv->netdev,
						  &uplink_priv->netdevice_nb,
						  &uplink_priv->netdevice_nn);
	return err;
	return flow_indr_dev_register(mlx5e_rep_indr_setup_cb, rpriv);
}

void mlx5e_rep_tc_netdevice_event_unregister(struct mlx5e_rep_priv *rpriv)
{
	struct mlx5_rep_uplink_priv *uplink_priv = &rpriv->uplink_priv;

	/* clean indirect TC block notifications */
	unregister_netdevice_notifier_dev_net(rpriv->netdev,
					      &uplink_priv->netdevice_nb,
					      &uplink_priv->netdevice_nn);
	flow_indr_dev_unregister(mlx5e_rep_indr_setup_cb, rpriv,
				 mlx5e_rep_indr_setup_tc_cb);
}

#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
+0 −4
Original line number Diff line number Diff line
@@ -33,7 +33,6 @@ void mlx5e_rep_encap_entry_detach(struct mlx5e_priv *priv,

int mlx5e_rep_setup_tc(struct net_device *dev, enum tc_setup_type type,
		       void *type_data);
void mlx5e_rep_indr_clean_block_privs(struct mlx5e_rep_priv *rpriv);

bool mlx5e_rep_tc_update_skb(struct mlx5_cqe64 *cqe,
			     struct sk_buff *skb,
@@ -65,9 +64,6 @@ static inline int
mlx5e_rep_setup_tc(struct net_device *dev, enum tc_setup_type type,
		   void *type_data) { return -EOPNOTSUPP; }

static inline void
mlx5e_rep_indr_clean_block_privs(struct mlx5e_rep_priv *rpriv) {}

struct mlx5e_tc_update_priv;
static inline bool
mlx5e_rep_tc_update_skb(struct mlx5_cqe64 *cqe,
+0 −1
Original line number Diff line number Diff line
@@ -1018,7 +1018,6 @@ destroy_tises:
static void mlx5e_cleanup_uplink_rep_tx(struct mlx5e_rep_priv *rpriv)
{
	mlx5e_rep_tc_netdevice_event_unregister(rpriv);
	mlx5e_rep_indr_clean_block_privs(rpriv);
	mlx5e_rep_bond_cleanup(rpriv);
	mlx5e_rep_tc_cleanup(rpriv);
}
Loading