Commit 53a25406 authored by Ondrej Zajicek (work)'s avatar Ondrej Zajicek (work)
Browse files

Merge branch 'oz-trie-table'

parents 4c6ee53f 24600c64
Loading
Loading
Loading
Loading
+91 −17
Original line number Diff line number Diff line
@@ -252,16 +252,9 @@ The global best route selection algorithm is (roughly) as follows:
</itemize>

<p><label id="dsc-table-sorted">Usually, a routing table just chooses a selected
route from a list of entries for one network. But if the <cf/sorted/ option is
activated, these lists of entries are kept completely sorted (according to
preference or some protocol-dependent metric). This is needed for some features
of some protocols (e.g. <cf/secondary/ option of BGP protocol, which allows to
accept not just a selected route, but the first route (in the sorted list) that
is accepted by filters), but it is incompatible with some other features (e.g.
<cf/deterministic med/ option of BGP protocol, which activates a way of choosing
selected route that cannot be described using comparison and ordering). Minor
advantage is that routes are shown sorted in <cf/show route/, minor disadvantage
is that it is slightly more computationally expensive.
route from a list of entries for one network. Optionally, these lists of entries
are kept completely sorted (according to preference or some protocol-dependent
metric). See <ref id="rtable-sorted" name="sorted"> table option for details.

<sect>Routes and network types
<label id="routes">
@@ -628,18 +621,73 @@ include "tablename.conf";;
	<cf/protocol/ times, and the <cf/iso long ms/ format for <cf/base/ and
	<cf/log/ times.

	<tag><label id="opt-table"><m/nettype/ table <m/name/ [sorted]</tag>
	Create a new routing table. The default routing tables <cf/master4/ and
	<cf/master6/ are created implicitly, other routing tables have to be
	added by this command.  Option <cf/sorted/ can be used to enable sorting
	of routes, see <ref id="dsc-table-sorted" name="sorted table">
	description for details.
	<tag><label id="opt-table"><m/nettype/ table <m/name/ [ { <m/option/; [<m/.../] } ]</tag>
	Define a new routing table. The default routing tables <cf/master4/ and
	<cf/master6/ are defined implicitly, other routing tables have to be
	defined by this option. See the <ref id="rtable-opts"
	name="routing table configuration section"> for routing table options.

	<tag><label id="opt-eval">eval <m/expr/</tag>
	Evaluates given filter expression. It is used by the developers for testing of filters.
</descrip>


<sect>Routing table options
<label id="rtable-opts">

<p>Most routing tables do not need any options and are defined without an option
block, but there are still some options to tweak routing table behavior. Note
that implicit tables (<cf/master4/ and <cf/master6/) can be redefined in order
to set options.

<descrip>
	<tag><label id="rtable-sorted">sorted <m/switch/</tag>
	Usually, a routing table just chooses the selected (best) route from a
	list of routes for each network, while keeping remaining routes unsorted.
	If enabled, these lists of routes are kept completely sorted (according
	to preference or some protocol-dependent metric).

	This is needed for some protocol features (e.g. <cf/secondary/ option of
	BGP protocol, which allows to accept not just a selected route, but the
	first route (in the sorted list) that is accepted by filters), but it is
	incompatible with some other features (e.g. <cf/deterministic med/
	option of BGP protocol, which activates a way of choosing selected route
	that cannot be described using comparison and ordering). Minor advantage
	is that routes are shown sorted in <cf/show route/, minor disadvantage
	is that it is slightly more computationally expensive. Default: off.

	<tag><label id="rtable-trie">trie <m/switch/</tag>
	BIRD routing tables are implemented with hash tables, which is efficient
	for exact-match lookups, but inconvenient for longest-match lookups or
	interval lookups (finding superprefix or subprefixes). This option
	activates additional trie structure that is used to accelerate these
	lookups, while using the hash table for exact-match lookups.

	This has advantage for <ref id="rpki" name="RPKI"> (on ROA tables),
	for <ref id="bgp-gateway" name="recursive next-hops"> (on IGP tables),
	and is required for <ref id="bgp-validate" name="flowspec validation">
	(on base IP tables). Another advantage is that interval results (like
	from <cf/show route in .../ command) are lexicographically sorted. The
	disadvantage is that trie-enabled routing tables require more memory,
	which may be an issue especially in multi-table setups. Default: off.

	<tag><label id="rtable-min-settle-time">min settle time <m/time/</tag>
	Specify a minimum value of the settle time. When a ROA table changes,
	automatic <ref id="proto-rpki-reload" name="RPKI reload"> may be
	triggered, after a short settle time. Minimum settle time is a delay
	from the last ROA table change to wait for more updates. Default: 1 s.


	<tag><label id="rtable-max-settle-time">max settle time <m/time/</tag>
	Specify a maximum value of the settle time. When a ROA table changes,
	automatic <ref id="proto-rpki-reload" name="RPKI reload"> may be
	triggered, after a short settle time. Maximum settle time is an upper
	limit to the settle time from the initial ROA table change even if
	there are consecutive updates gradually renewing the settle time.
	Default: 20 s.
</descrip>


<sect>Protocol options
<label id="protocol-opts">

@@ -2290,6 +2338,7 @@ avoid routing loops.
<item> <rfc id="8092"> - BGP Large Communities Attribute
<item> <rfc id="8203"> - BGP Administrative Shutdown Communication
<item> <rfc id="8212"> - Default EBGP Route Propagation Behavior without Policies
<item> <rfc id="9117"> - Revised Validation Procedure for BGP Flow Specifications
</itemize>

<sect1>Route selection rules
@@ -2674,7 +2723,7 @@ using the following configuration parameters:

	<tag><label id="bgp-error-wait-time">error wait time <m/number/,<m/number/</tag>
	Minimum and maximum delay in seconds between a protocol failure (either
	local or reported by the peer) and automatic restart. Doesn't apply
	local or reported by the peer) and automatic restart. Doesn not apply
	when <cf/disable after error/ is configured. If consecutive errors
	happen, the delay is increased exponentially until it reaches the
	maximum. Default: 60, 300.
@@ -2852,6 +2901,31 @@ be used in explicit configuration.
	explicitly (to conserve memory). This option requires that the connected
	routing table is <ref id="dsc-table-sorted" name="sorted">. Default: off.

	<tag><label id="bgp-validate">validate <m/switch/</tag>
	Apply flowspec validation procedure as described in <rfc id="8955">
	section 6 and <rfc id="9117">. The Validation procedure enforces that
	only routers in the forwarding path for a network can originate flowspec
	rules for that network. The validation procedure should be used for EBGP
	to prevent injection of malicious flowspec rules from outside, but it
	should also be used for IBGP to ensure that selected flowspec rules are
	consistent with selected IP routes. The validation procedure uses an IP
	routing table (<ref id="bgp-base-table" name="base table">, see below)
	against which flowspec rules are validated. This option is limited to
	flowspec channels. Default: off (for compatibility reasons).

	Note that currently the flowspec validation does not work reliably
	together with <ref id="bgp-import-table" name="import table"> option
	enabled on flowspec channels.

	<tag><label id="bgp-base-table">base table <m/name/</tag>
	Specifies an IP table used for the flowspec validation procedure. The
	table must have enabled <cf/trie/ option, otherwise the validation
	procedure would not work. The type of the table must be <cf/ipv4/ for
	<cf/flow4/ channels and <cf/ipv6/ for <cf/flow6/ channels. This option
	is limited to flowspec channels. Default: the main table of the
	<cf/ipv4/ / <cf/ipv6/ channel of the same BGP instance, or the
	<cf/master4/ / <cf/master6/ table if there is no such channel.

	<tag><label id="bgp-extended-next-hop">extended next hop <m/switch/</tag>
	BGP expects that announced next hops have the same address family as
	associated network prefixes. This option provides an extension to use
+81 −4
Original line number Diff line number Diff line
@@ -140,18 +140,23 @@ struct f_tree {
  void *data;
};

#define TRIE_STEP		4
#define TRIE_STACK_LENGTH	33

struct f_trie_node4
{
  ip4_addr addr, mask, accept;
  uint plen;
  struct f_trie_node4 *c[2];
  u16 plen;
  u16 local;
  struct f_trie_node4 *c[1 << TRIE_STEP];
};

struct f_trie_node6
{
  ip6_addr addr, mask, accept;
  uint plen;
  struct f_trie_node6 *c[2];
  u16 plen;
  u16 local;
  struct f_trie_node6 *c[1 << TRIE_STEP];
};

struct f_trie_node
@@ -168,9 +173,20 @@ struct f_trie
  u8 zero;
  s8 ipv4;				/* -1 for undefined / empty */
  u16 data_size;			/* Additional data for each trie node */
  u32 prefix_count;			/* Works only for restricted tries (pxlen == l == h) */
  struct f_trie_node root;		/* Root trie node */
};

struct f_trie_walk_state
{
  u8 ipv4;
  u8 accept_length;			/* Current inter-node prefix position */
  u8 start_pos;				/* Initial prefix position in stack[0] */
  u8 local_pos;				/* Current intra-node prefix position */
  u8 stack_pos;				/* Current node in stack below */
  const struct f_trie_node *stack[TRIE_STACK_LENGTH];
};

struct f_tree *f_new_tree(void);
struct f_tree *build_tree(struct f_tree *);
const struct f_tree *find_tree(const struct f_tree *t, const struct f_val *val);
@@ -181,9 +197,70 @@ void tree_walk(const struct f_tree *t, void (*hook)(const struct f_tree *, void
struct f_trie *f_new_trie(linpool *lp, uint data_size);
void *trie_add_prefix(struct f_trie *t, const net_addr *n, uint l, uint h);
int trie_match_net(const struct f_trie *t, const net_addr *n);
int trie_match_longest_ip4(const struct f_trie *t, const net_addr_ip4 *net, net_addr_ip4 *dst, ip4_addr *found0);
int trie_match_longest_ip6(const struct f_trie *t, const net_addr_ip6 *net, net_addr_ip6 *dst, ip6_addr *found0);
void trie_walk_init(struct f_trie_walk_state *s, const struct f_trie *t, const net_addr *from);
int trie_walk_next(struct f_trie_walk_state *s, net_addr *net);
int trie_same(const struct f_trie *t1, const struct f_trie *t2);
void trie_format(const struct f_trie *t, buffer *buf);

static inline int
trie_match_next_longest_ip4(net_addr_ip4 *n, ip4_addr *found)
{
  while (n->pxlen)
  {
    n->pxlen--;
    ip4_clrbit(&n->prefix, n->pxlen);

    if (ip4_getbit(*found, n->pxlen))
      return 1;
  }

  return 0;
}

static inline int
trie_match_next_longest_ip6(net_addr_ip6 *n, ip6_addr *found)
{
  while (n->pxlen)
  {
    n->pxlen--;
    ip6_clrbit(&n->prefix, n->pxlen);

    if (ip6_getbit(*found, n->pxlen))
      return 1;
  }

  return 0;
}


#define TRIE_WALK_TO_ROOT_IP4(trie, net, dst) ({		\
  net_addr_ip4 dst;						\
  ip4_addr _found;						\
  for (int _n = trie_match_longest_ip4(trie, net, &dst, &_found); \
       _n;							\
       _n = trie_match_next_longest_ip4(&dst, &_found))

#define TRIE_WALK_TO_ROOT_IP6(trie, net, dst) ({		\
  net_addr_ip6 dst;						\
  ip6_addr _found;						\
  for (int _n = trie_match_longest_ip6(trie, net, &dst, &_found); \
       _n;							\
       _n = trie_match_next_longest_ip6(&dst, &_found))

#define TRIE_WALK_TO_ROOT_END })


#define TRIE_WALK(trie, net, from) ({				\
  net_addr net;							\
  struct f_trie_walk_state tws_;				\
  trie_walk_init(&tws_, trie, from);				\
  while (trie_walk_next(&tws_, &net))

#define TRIE_WALK_END })


#define F_CMP_ERROR 999

const char *f_type_name(enum f_type t);
+33 −0
Original line number Diff line number Diff line
@@ -499,6 +499,33 @@ prefix set pxs;

	bt_assert(1.2.0.0/16 ~ [ 1.0.0.0/8{ 15 , 17 } ]);
	bt_assert([ 10.0.0.0/8{ 15 , 17 } ] != [ 11.0.0.0/8{ 15 , 17 } ]);

	/* Formatting of prefix sets, some cases are a bit strange */
	bt_assert(format([ 0.0.0.0/0 ]) = "[0.0.0.0/0]");
	bt_assert(format([ 10.10.0.0/32 ]) = "[10.10.0.0/32{0.0.0.1}]");
	bt_assert(format([ 10.10.0.0/17 ]) = "[10.10.0.0/17{0.0.128.0}]");
	bt_assert(format([ 10.10.0.0/17{17,19} ]) = "[10.10.0.0/17{0.0.224.0}]"); # 224 = 128+64+32
	bt_assert(format([ 10.10.128.0/17{18,19} ]) = "[10.10.128.0/18{0.0.96.0}, 10.10.192.0/18{0.0.96.0}]"); # 96 = 64+32
	bt_assert(format([ 10.10.64.0/18- ]) = "[0.0.0.0/0, 0.0.0.0/1{128.0.0.0}, 0.0.0.0/2{64.0.0.0}, 0.0.0.0/3{32.0.0.0}, 10.10.0.0/16{255.255.0.0}, 10.10.0.0/17{0.0.128.0}, 10.10.64.0/18{0.0.64.0}]");
	bt_assert(format([ 10.10.64.0/18+ ]) = "[10.10.64.0/18{0.0.96.0}, 10.10.64.0/20{0.0.31.255}, 10.10.80.0/20{0.0.31.255}, 10.10.96.0/20{0.0.31.255}, 10.10.112.0/20{0.0.31.255}]");

	bt_assert(format([ 10.10.160.0/19 ]) = "[10.10.160.0/19{0.0.32.0}]");
	bt_assert(format([ 10.10.160.0/19{19,22} ]) = "[10.10.160.0/19{0.0.32.0}, 10.10.160.0/20{0.0.28.0}, 10.10.176.0/20{0.0.28.0}]"); # 28 = 16+8+4
	bt_assert(format([ 10.10.160.0/19+ ]) = "[10.10.160.0/19{0.0.32.0}, 10.10.160.0/20{0.0.31.255}, 10.10.176.0/20{0.0.31.255}]");

	bt_assert(format([ ::/0 ]) = "[::/0]");
	bt_assert(format([ 11:22:33:44:55:66:77:88/128 ]) = "[11:22:33:44:55:66:77:88/128{::1}]");
	bt_assert(format([ 11:22:33:44::/64 ]) = "[11:22:33:44::/64{0:0:0:1::}]");
	bt_assert(format([ 11:22:33:44::/64+ ]) = "[11:22:33:44::/64{::1:ffff:ffff:ffff:ffff}]");

	bt_assert(format([ 11:22:33:44::/65 ]) = "[11:22:33:44::/65{::8000:0:0:0}]");
	bt_assert(format([ 11:22:33:44::/65{65,67} ]) = "[11:22:33:44::/65{::e000:0:0:0}]"); # e = 8+4+2
	bt_assert(format([ 11:22:33:44:8000::/65{66,67} ]) = "[11:22:33:44:8000::/66{::6000:0:0:0}, 11:22:33:44:c000::/66{::6000:0:0:0}]"); # 6 = 4+2
	bt_assert(format([ 11:22:33:44:4000::/66- ]) = "[::/0, ::/1{8000::}, ::/2{4000::}, ::/3{2000::}, 11:22:33:44::/64{ffff:ffff:ffff:ffff::}, 11:22:33:44::/65{::8000:0:0:0}, 11:22:33:44:4000::/66{::4000:0:0:0}]");
	bt_assert(format([ 11:22:33:44:4000::/66+ ]) = "[11:22:33:44:4000::/66{::6000:0:0:0}, 11:22:33:44:4000::/68{::1fff:ffff:ffff:ffff}, 11:22:33:44:5000::/68{::1fff:ffff:ffff:ffff}, 11:22:33:44:6000::/68{::1fff:ffff:ffff:ffff}, 11:22:33:44:7000::/68{::1fff:ffff:ffff:ffff}]");
	bt_assert(format([ 11:22:33:44:c000::/67 ]) = "[11:22:33:44:c000::/67{::2000:0:0:0}]");
	bt_assert(format([ 11:22:33:44:c000::/67{67,71} ]) = "[11:22:33:44:c000::/67{::2000:0:0:0}, 11:22:33:44:c000::/68{::1e00:0:0:0}, 11:22:33:44:d000::/68{::1e00:0:0:0}]");
	bt_assert(format([ 11:22:33:44:c000::/67+ ]) = "[11:22:33:44:c000::/67{::2000:0:0:0}, 11:22:33:44:c000::/68{::1fff:ffff:ffff:ffff}, 11:22:33:44:d000::/68{::1fff:ffff:ffff:ffff}]");
}

bt_test_suite(t_prefix_set, "Testing prefix sets");
@@ -578,6 +605,12 @@ prefix set pxs;
	bt_assert(2000::/29 !~ pxs);
	bt_assert(1100::/10 !~ pxs);
	bt_assert(2010::/26 !~ pxs);

	pxs = [ 52E0::/13{13,128} ];
	bt_assert(52E7:BE81:379B:E6FD:541F:B0D0::/93 ~ pxs);

	pxs = [ 41D8:8718::/30{0,30}, 413A:99A8:6C00::/38{38,128} ];
	bt_assert(4180::/9 ~ pxs);
}

bt_test_suite(t_prefix6_set, "Testing prefix IPv6 sets");
Loading