Commit f459deee authored by Maria Matejka's avatar Maria Matejka
Browse files

Thread documentation: chapters 0, 1 and 2

parent 493d45d9
Loading
Loading
Loading
Loading
+300 KiB
Loading image diff...
+114 −0
Original line number Diff line number Diff line
# BIRD Journey to Threads. Chapter 0: The Reason Why.

BIRD is a fast, robust and memory-efficient routing daemon designed and
implemented at the end of 20th century. Its concept of multiple routing
tables with pipes between them, as well as a procedural filtering language,
has been unique for a long time and is still one of main reasons why people use
BIRD for big loads of routing data.

## IPv4 / IPv6 duality: Solved

The original design of BIRD has also some drawbacks. One of these was an idea
of two separate daemons – one BIRD for IPv4 and another BIRD for IPv6, built from the same
codebase, cleverly using `#ifdef IPV6` constructions to implement the
common parts of BIRD algorithms and data structures only once.
If IPv6 adoption went forward as people thought in that time,
it would work; after finishing the worldwide transition to IPv6, people could
just stop building BIRD for IPv4 and drop the `#ifdef`-ed code.

The history went other way, however. BIRD developers therefore decided to *integrate*
these two versions into one daemon capable of any address family, allowing for
not only IPv6 but for virtually anything. This rework brought quite a lot of
backward-incompatible changes, therefore we decided to release it as a version 2.0.
This work was mostly finished in 2018 and as for March 2021, we have already
switched the 1.6.x branch to a bugfix-only mode.

## BIRD is single-threaded now

The second drawback is a single-threaded design. Looking back to 1998, this was
a good idea. A common PC had one single core and BIRD was targeting exactly
this segment. As the years went by, the manufacturers launched multicore x86 chips
(AMD Opteron in 2004, Intel Pentium D in 2005). This ultimately led to a world
where as of March 2021, there is virtually no new PC sold with a single-core CPU.

Together with these changes, the speed of one single core has not been growing as fast
as the Internet is growing. BIRD is still capable to handle the full BGP table 
(868k IPv4 routes in March 2021) with one core, anyway when BIRD starts, it may take
long minutes to converge.

## Intermezzo: Filters

In 2018, we took some data we had from large internet exchanges  and simulated
a cold start of BIRD as a route server. We used `linux-perf` to find most time-critical
parts of BIRD and it pointed very clearly to the filtering code. It also showed that the
IPv4 version of BIRD v1.6.x is substantially faster than the *integrated* version, while
the IPv6 version was quite as fast as the *integrated* one.

Here we should show a little bit more about how the filters really work. Let's use
an example of a simple filter:

```
filter foo {
  if net ~ [10.0.0.0/8+] then reject;
  preference = 2 * preference - 41;
  accept;
}
```

This filter gets translated to an infix internal structure.

![Example of filter internal representation](00_filter_structure.png)

When executing, the filter interpreter just walks the filter internal structure recursively in the
right order, executes the instructions, collects their results and finishes by
either rejection or acceptation of the route

## Filter rework

Further analysis of the filter code revealed an absurdly-looking result. The
most executed parts of the interpreter function were the `push` CPU
instructions on its very beginning and the `pop` CPU instructions on its very
end. This came from the fact that the interpreter function was quite long, yet
most of the filter instructions used an extremely short path, doing all the
stack manipulation at the beginning, branching by the filter instruction type,
then it executed just several CPU instructions, popped everything from the
stack back and returned.

After some thoughts how to minimize stack manipulation when everything you need
is to take two numbers and multiply them, we decided to preprocess the filter
internal structure to another structure which is much easier to execute. The
interpreter is now using a data stack and behaves generally as a
postfix-ordered language. We also thought about Lua which showed up to be quite
a lot of work implementing all the glue achieving about the same performance.

After these changes, we managed to reduce the filter execution time by 10–40%,
depending on how complex the filter is.
Anyway, even this reduction is quite too little when there is one CPU core
running for several minutes while others are sleeping.

## We need more threads

As a side effect of the rework, the new filter interpreter is also completely
thread-safe. It seemed to be the way to go – running the filters in parallel
while keeping everything else single-threaded. The main problem of this
solution is a too fine granularity of parallel jobs. We would spend lots of
time on synchronization overhead.

The only filter parallel execution was also too one-sided, useful only for
configurations with complex filters. In other cases, the major problem is best
route recalculation, OSPF recalculation or also kernel synchronization.
It also turned out to be dirty a lot from the code cleanliness' point of view.

Therefore we chose to make BIRD multithreaded completely. We designed a way how
to gradually enable parallel computation and best usage of all available CPU
cores. Our goals are three:

* We want to keep current functionality. Parallel computation should never drop
  a useful feature.
* We want to do little steps. No big reworks, even though even the smallest
  possible step will need quite a lot of refactoring before.
* We want to be backwards compatible as much as possible.

*It's still a long road to the version 2.1. This series of texts should document
what is needed to be changed, why we do it and how. In the next chapter, we're
going to describe the structures for routes and their attributes. Stay tuned!*
+159 −0
Original line number Diff line number Diff line
# BIRD Journey to Threads. Chapter 1: The Route and its Attributes

BIRD is a fast, robust and memory-efficient routing daemon designed and
implemented at the end of 20th century. We're doing a significant amount of
BIRD's internal structure changes to make it possible to run in multiple
threads in parallel. This chapter covers necessary changes of data structures
which store every single routing data.

*If you want to see the changes in code, look (basically) into the
`route-storage-updates` branch. Not all of them are already implemented, anyway
most of them are pretty finished as of end of March, 2021.*

## How routes are stored

BIRD routing table is just a hierarchical noSQL database. On top level, the
routes are keyed by their destination, called *net*. Due to historic reasons,
the *net* is not only *IPv4 prefix*, *IPv6 prefix*, *IPv4 VPN prefix* etc.,
but also *MPLS label*, *ROA information* or *BGP Flowspec record*. As there may
be several routes for each *net*, an obligatory part of the key is *src* aka.
*route source*. The route source is a tuple of the originating protocol
instance and a 32-bit unsigned integer. If a protocol wants to withdraw a route,
it is enough and necessary to have the *net* and *src* to identify what route
is to be withdrawn.

The route itself consists of (basically) a list of key-value records, with
value types ranging from a 16-bit unsigned integer for preference to a complex
BGP path structure. The keys are pre-defined by protocols (e.g. BGP path or
OSPF metrics), or by BIRD core itself (preference, route gateway).
Finally, the user can declare their own attribute keys using the keyword
`attribute` in config.

## Attribute list implementation

Currently, there are three layers of route attributes. We call them *route*
(*rte*), *attributes* (*rta*) and *extended attributes* (*ea*, *eattr*).

The first layer, *rte*, contains the *net* pointer, several fixed-size route
attributes (mostly preference and protocol-specific metrics), flags, lastmod
time and a pointer to *rta*.

The second layer, *rta*, contains the *src* (a pointer to a singleton instance),
a route gateway, several other fixed-size route attributes and a pointer to
*ea* list.

The third layer, *ea* list, is a variable-length list of key-value attributes,
containing all the remaining route attributes.

Distribution of the route attributes between the attribute layers is somehow
arbitrary. Mostly, in the first and second layer, there are attributes that
were thought to be accessed frequently (e.g. in best route selection) and
filled in in most routes, while the third layer is for infrequently used
and/or infrequently accessed route attributes.

## Attribute list deduplication

When protocols originate routes, there are commonly more routes with the
same attribute list. BIRD could ignore this fact, anyway if you have several
tables connected with pipes, it is more memory-efficient to store the same
attribute lists only once.

Therefore, the two lower layers (*rta* and *ea*) are hashed and stored in a 
BIRD-global database. Routes (*rte*) contain a pointer to *rta* in this
database, maintaining a use-count of each *rta*. Attributes (*rta*) contain
a pointer to normalized (sorted by numerical key ID) *ea*.

## Attribute list rework

The first thing to change is the distribution of route attributes between
attribute list layers. We decided to make the first layer (*rte*) only the key
and other per-record internal technical information. Therefore we move *src* to
*rte* and preference to *rta* (beside other things). *This is already done.*

We also found out that the nexthop (gateway), originally one single IP address
and an interface, has evolved to a complex attribute with several sub-attributes;
not only considering multipath routing but also MPLS stacks and other per-route
attributes. This has led to a too complex data structure holding the nexthop set.

We decided finally to squash *rta* and *ea* to one type of data structure,
allowing for completely dynamic route attribute lists. This is also supported
by adding other *net* types (BGP FlowSpec or ROA) where lots of the fields make
no sense at all, yet we still want to use the same data structures and implementation
as we don't like duplicating code. *Multithreading doesn't depend on this change,
anyway this change is going to happen soon anyway.*

## Route storage

The process of route import from protocol into a table can be divided into several phases:

1. (In protocol code.) Create the route itself (typically from
   protocol-internal data) and choose the right channel to use.
2. (In protocol code.) Create the *rta* and *ea* and obtain an appropriate
   hashed pointer. Allocate the *rte* structure and fill it in. 
3. (Optionally.) Store the route to the *import table*.
4. Run filters. If reject, free everything.
5. Check whether this is a real change (it may be idempotent). If not, free everything and do nothing more.
6. Run the best route selection algorithm.
7. Execute exports if needed.

We found out that the *rte* structure allocation is done too early. BIRD uses
global optimized allocators for fixed-size blocks (which *rte* is) to reduce
its memory footprint, therefore the allocation of *rte* structure would be a
synchronization point in multithreaded environment.

The common code is also much more complicated when we have to track whether the
current *rte* has to be freed or not. This is more a problem in export than in
import as the export filter can also change the route (and therefore allocate
another *rte*). The changed route must be therefore freed after use. All the
route changing code must also track whether this route is writable or
read-only.

We therefore introduce a variant of *rte* called *rte_storage*. Both of these
hold the same, the layer-1 route information (destination, author, cached
attribute pointer, flags etc.), anyway *rte* is always local and *rte_storage*
is intended to be put in global data structures.

This change allows us to remove lots of the code which only tracks whether any
*rte* is to be freed as *rte*'s are almost always allocated on-stack, naturally
limiting their lifetime. If not on-stack, it's the responsibility of the owner
to free the *rte* after import is done.

This change also removes the need for *rte* allocation in protocol code and
also *rta* can be safely allocated on-stack. As a result, protocols can simply
allocate all the data on stack, call the update routine and the common code in
BIRD's *nest* does all the storage for them.

Allocating *rta* on-stack is however not required. BGP and OSPF use this to
import several routes with the same attribute list. In BGP, this is due to the
format of BGP update messages containing first the attributes and then the
destinations (BGP NLRI's). In OSPF, in addition to *rta* deduplication, it is
also presumed that no import filter (or at most some trivial changes) is applied
as OSPF would typically not work well when filtered.

*This change is already done.*

## Route cleanup and table maintenance

In some cases, the route update is not originated by a protocol/channel code.
When the channel shuts down, all routes originated by that channel are simply
cleaned up. Also routes with recursive routes may get changed without import,
simply by changing the IGP route. 

This is currently done by a `rt_event` (see `nest/rt-table.c` for source code)
which is to be converted to a parallel thread, running when nobody imports any
route. *This change is freshly done in branch `guernsey`.*

## Parallel protocol execution

The long-term goal of these reworks is to allow for completely independent
execution of all the protocols. Typically, there is no direct interaction
between protocols; everything is done thought BIRD's *nest*. Protocols should
therefore run in parallel in future and wait/lock only when something is needed
to do externally.

We also aim for a clean and documented protocol API.

*It's still a long road to the version 2.1. This series of texts should document
what is needed to be changed, why we do it and how. In the next chapter, we're
going to describe how the route is exported from table to protocols and how this
process is changing. Stay tuned!*