Thread documentation: chapters 0, 1 and 2 (f459deee) · Commits · 戴 / Bird

doc/threads/00_filter_structure.png

0 → 100644

+300 KiB

Loading image diff...

doc/threads/00_the_name_of_the_game.md

0 → 100644

+114 −0

Original line number	Diff line number	Diff line
		# BIRD Journey to Threads. Chapter 0: The Reason Why.

		BIRD is a fast, robust and memory-efficient routing daemon designed and
		implemented at the end of 20th century. Its concept of multiple routing
		tables with pipes between them, as well as a procedural filtering language,
		has been unique for a long time and is still one of main reasons why people use
		BIRD for big loads of routing data.

		## IPv4 / IPv6 duality: Solved

		The original design of BIRD has also some drawbacks. One of these was an idea
		of two separate daemons – one BIRD for IPv4 and another BIRD for IPv6, built from the same
		codebase, cleverly using `#ifdef IPV6` constructions to implement the
		common parts of BIRD algorithms and data structures only once.
		If IPv6 adoption went forward as people thought in that time,
		it would work; after finishing the worldwide transition to IPv6, people could
		just stop building BIRD for IPv4 and drop the `#ifdef`-ed code.

		The history went other way, however. BIRD developers therefore decided to integrate
		these two versions into one daemon capable of any address family, allowing for
		not only IPv6 but for virtually anything. This rework brought quite a lot of
		backward-incompatible changes, therefore we decided to release it as a version 2.0.
		This work was mostly finished in 2018 and as for March 2021, we have already
		switched the 1.6.x branch to a bugfix-only mode.

		## BIRD is single-threaded now

		The second drawback is a single-threaded design. Looking back to 1998, this was
		a good idea. A common PC had one single core and BIRD was targeting exactly
		this segment. As the years went by, the manufacturers launched multicore x86 chips
		(AMD Opteron in 2004, Intel Pentium D in 2005). This ultimately led to a world
		where as of March 2021, there is virtually no new PC sold with a single-core CPU.

		Together with these changes, the speed of one single core has not been growing as fast
		as the Internet is growing. BIRD is still capable to handle the full BGP table
		(868k IPv4 routes in March 2021) with one core, anyway when BIRD starts, it may take
		long minutes to converge.

		## Intermezzo: Filters

		In 2018, we took some data we had from large internet exchanges and simulated
		a cold start of BIRD as a route server. We used `linux-perf` to find most time-critical
		parts of BIRD and it pointed very clearly to the filtering code. It also showed that the
		IPv4 version of BIRD v1.6.x is substantially faster than the integrated version, while
		the IPv6 version was quite as fast as the integrated one.

		Here we should show a little bit more about how the filters really work. Let's use
		an example of a simple filter:

		```
		filter foo {
		if net ~ [10.0.0.0/8+] then reject;
		preference = 2 * preference - 41;
		accept;
		}
		```

		This filter gets translated to an infix internal structure.

		![Example of filter internal representation](00_filter_structure.png)

		When executing, the filter interpreter just walks the filter internal structure recursively in the
		right order, executes the instructions, collects their results and finishes by
		either rejection or acceptation of the route

		## Filter rework

		Further analysis of the filter code revealed an absurdly-looking result. The
		most executed parts of the interpreter function were the `push` CPU
		instructions on its very beginning and the `pop` CPU instructions on its very
		end. This came from the fact that the interpreter function was quite long, yet
		most of the filter instructions used an extremely short path, doing all the
		stack manipulation at the beginning, branching by the filter instruction type,
		then it executed just several CPU instructions, popped everything from the
		stack back and returned.

		After some thoughts how to minimize stack manipulation when everything you need
		is to take two numbers and multiply them, we decided to preprocess the filter
		internal structure to another structure which is much easier to execute. The
		interpreter is now using a data stack and behaves generally as a
		postfix-ordered language. We also thought about Lua which showed up to be quite
		a lot of work implementing all the glue achieving about the same performance.

		After these changes, we managed to reduce the filter execution time by 10–40%,
		depending on how complex the filter is.
		Anyway, even this reduction is quite too little when there is one CPU core
		running for several minutes while others are sleeping.

		## We need more threads

		As a side effect of the rework, the new filter interpreter is also completely
		thread-safe. It seemed to be the way to go – running the filters in parallel
		while keeping everything else single-threaded. The main problem of this
		solution is a too fine granularity of parallel jobs. We would spend lots of
		time on synchronization overhead.

		The only filter parallel execution was also too one-sided, useful only for
		configurations with complex filters. In other cases, the major problem is best
		route recalculation, OSPF recalculation or also kernel synchronization.
		It also turned out to be dirty a lot from the code cleanliness' point of view.

		Therefore we chose to make BIRD multithreaded completely. We designed a way how
		to gradually enable parallel computation and best usage of all available CPU
		cores. Our goals are three:

		* We want to keep current functionality. Parallel computation should never drop
		a useful feature.
		* We want to do little steps. No big reworks, even though even the smallest
		possible step will need quite a lot of refactoring before.
		* We want to be backwards compatible as much as possible.

		*It's still a long road to the version 2.1. This series of texts should document
		what is needed to be changed, why we do it and how. In the next chapter, we're
		going to describe the structures for routes and their attributes. Stay tuned!*

doc/threads/01_the_route_and_its_attributes.md

0 → 100644

+159 −0

Original line number	Diff line number	Diff line
		# BIRD Journey to Threads. Chapter 1: The Route and its Attributes

		BIRD is a fast, robust and memory-efficient routing daemon designed and
		implemented at the end of 20th century. We're doing a significant amount of
		BIRD's internal structure changes to make it possible to run in multiple
		threads in parallel. This chapter covers necessary changes of data structures
		which store every single routing data.

		*If you want to see the changes in code, look (basically) into the
		`route-storage-updates` branch. Not all of them are already implemented, anyway
		most of them are pretty finished as of end of March, 2021.*

		## How routes are stored

		BIRD routing table is just a hierarchical noSQL database. On top level, the
		routes are keyed by their destination, called net. Due to historic reasons,
		the net is not only IPv4 prefix, IPv6 prefix, IPv4 VPN prefix etc.,
		but also MPLS label, ROA information or BGP Flowspec record. As there may
		be several routes for each net, an obligatory part of the key is src aka.
		route source. The route source is a tuple of the originating protocol
		instance and a 32-bit unsigned integer. If a protocol wants to withdraw a route,
		it is enough and necessary to have the net and src to identify what route
		is to be withdrawn.

		The route itself consists of (basically) a list of key-value records, with
		value types ranging from a 16-bit unsigned integer for preference to a complex
		BGP path structure. The keys are pre-defined by protocols (e.g. BGP path or
		OSPF metrics), or by BIRD core itself (preference, route gateway).
		Finally, the user can declare their own attribute keys using the keyword
		`attribute` in config.

		## Attribute list implementation

		Currently, there are three layers of route attributes. We call them route
		(rte), attributes (rta) and extended attributes (ea, eattr).

		The first layer, rte, contains the net pointer, several fixed-size route
		attributes (mostly preference and protocol-specific metrics), flags, lastmod
		time and a pointer to rta.

		The second layer, rta, contains the src (a pointer to a singleton instance),
		a route gateway, several other fixed-size route attributes and a pointer to
		ea list.

		The third layer, ea list, is a variable-length list of key-value attributes,
		containing all the remaining route attributes.

		Distribution of the route attributes between the attribute layers is somehow
		arbitrary. Mostly, in the first and second layer, there are attributes that
		were thought to be accessed frequently (e.g. in best route selection) and
		filled in in most routes, while the third layer is for infrequently used
		and/or infrequently accessed route attributes.

		## Attribute list deduplication

		When protocols originate routes, there are commonly more routes with the
		same attribute list. BIRD could ignore this fact, anyway if you have several
		tables connected with pipes, it is more memory-efficient to store the same
		attribute lists only once.

		Therefore, the two lower layers (rta and ea) are hashed and stored in a
		BIRD-global database. Routes (rte) contain a pointer to rta in this
		database, maintaining a use-count of each rta. Attributes (rta) contain
		a pointer to normalized (sorted by numerical key ID) ea.

		## Attribute list rework

		The first thing to change is the distribution of route attributes between
		attribute list layers. We decided to make the first layer (rte) only the key
		and other per-record internal technical information. Therefore we move src to
		rte and preference to rta (beside other things). This is already done.

		We also found out that the nexthop (gateway), originally one single IP address
		and an interface, has evolved to a complex attribute with several sub-attributes;
		not only considering multipath routing but also MPLS stacks and other per-route
		attributes. This has led to a too complex data structure holding the nexthop set.

		We decided finally to squash rta and ea to one type of data structure,
		allowing for completely dynamic route attribute lists. This is also supported
		by adding other net types (BGP FlowSpec or ROA) where lots of the fields make
		no sense at all, yet we still want to use the same data structures and implementation
		as we don't like duplicating code. *Multithreading doesn't depend on this change,
		anyway this change is going to happen soon anyway.*

		## Route storage

		The process of route import from protocol into a table can be divided into several phases:

		1. (In protocol code.) Create the route itself (typically from
		protocol-internal data) and choose the right channel to use.
		2. (In protocol code.) Create the rta and ea and obtain an appropriate
		hashed pointer. Allocate the rte structure and fill it in.
		3. (Optionally.) Store the route to the import table.
		4. Run filters. If reject, free everything.
		5. Check whether this is a real change (it may be idempotent). If not, free everything and do nothing more.
		6. Run the best route selection algorithm.
		7. Execute exports if needed.

		We found out that the rte structure allocation is done too early. BIRD uses
		global optimized allocators for fixed-size blocks (which rte is) to reduce
		its memory footprint, therefore the allocation of rte structure would be a
		synchronization point in multithreaded environment.

		The common code is also much more complicated when we have to track whether the
		current rte has to be freed or not. This is more a problem in export than in
		import as the export filter can also change the route (and therefore allocate
		another rte). The changed route must be therefore freed after use. All the
		route changing code must also track whether this route is writable or
		read-only.

		We therefore introduce a variant of rte called rte_storage. Both of these
		hold the same, the layer-1 route information (destination, author, cached
		attribute pointer, flags etc.), anyway rte is always local and rte_storage
		is intended to be put in global data structures.

		This change allows us to remove lots of the code which only tracks whether any
		rte is to be freed as rte's are almost always allocated on-stack, naturally
		limiting their lifetime. If not on-stack, it's the responsibility of the owner
		to free the rte after import is done.

		This change also removes the need for rte allocation in protocol code and
		also rta can be safely allocated on-stack. As a result, protocols can simply
		allocate all the data on stack, call the update routine and the common code in
		BIRD's nest does all the storage for them.

		Allocating rta on-stack is however not required. BGP and OSPF use this to
		import several routes with the same attribute list. In BGP, this is due to the
		format of BGP update messages containing first the attributes and then the
		destinations (BGP NLRI's). In OSPF, in addition to rta deduplication, it is
		also presumed that no import filter (or at most some trivial changes) is applied
		as OSPF would typically not work well when filtered.

		This change is already done.

		## Route cleanup and table maintenance

		In some cases, the route update is not originated by a protocol/channel code.
		When the channel shuts down, all routes originated by that channel are simply
		cleaned up. Also routes with recursive routes may get changed without import,
		simply by changing the IGP route.

		This is currently done by a `rt_event` (see `nest/rt-table.c` for source code)
		which is to be converted to a parallel thread, running when nobody imports any
		route. This change is freshly done in branch `guernsey`.

		## Parallel protocol execution

		The long-term goal of these reworks is to allow for completely independent
		execution of all the protocols. Typically, there is no direct interaction
		between protocols; everything is done thought BIRD's nest. Protocols should
		therefore run in parallel in future and wait/lock only when something is needed
		to do externally.

		We also aim for a clean and documented protocol API.

		*It's still a long road to the version 2.1. This series of texts should document
		what is needed to be changed, why we do it and how. In the next chapter, we're
		going to describe how the route is exported from table to protocols and how this
		process is changing. Stay tuned!*

Admin message