Commit 1378fa94 authored by Christophe Favergeon's avatar Christophe Favergeon
Browse files

Improved compute graph documentation

More details about the new dynamic mode.
parent fc38d7d2
Loading
Loading
Loading
Loading
+46 −0
Original line number Diff line number Diff line
# Cyclo static scheduling

Beginning with the version `1.7.0` of the Python wrapper and version >= `1.12` of CMSIS-DSP, cyclo static scheduling has been added.

## What is the problem it is trying to solve ?

Let's consider a sample rate converter from 48 kHz to 44.1 kHz.

For each input sample, on average it produces 44.1 / 48 = 0.91875 samples.

There are two ways to do this:

- One can observe that 48000/44100 = 160/147. So each time 160 samples are consumed, 147 samples are produced
- The number of sample produced can vary from one execution of the node to the other so that on average, 0.91875 samples are generated per execution

In the first case, it is synchronous but you need to wait for 160 input samples before being able to do some processing. It is introducing a latency and depending on the sample rate and use case, this latency may be too big. We need more flexibility.

In the second case, we have the flexibility but it is no more synchronous because the resampler is not producing the same amount of samples at each execution.

But we can observe that even if is is no more stationary, it is periodic. After consuming 160 samples the behavior should repeat.

One can use the resampler in the [SpeexDSP](https://gitlab.xiph.org/xiph/speexdsp) project to test. If we decide to consume only 40 samples in input to have less latency, then the resampler of SpeexDSP will produce 37,37,37 and 36 samples for the first 4 executions.

And (40+40+40+40)/(37+37+37+36) = 160 / 147.

So the flow of data on the output is not static but it is periodic.

This is now supported in the CMSIS-DSP compute graph and on each IO one can define a period. For this example, it could be:

```python
b=Sampler("sampler",floatType,40,[37,37,37,36])
```

Note that in the C++ class, the template parameters giving the number of samples produced or consumed on an IO cannot be used any more in this case. The value is still generated but now represent the maximum on a period.

And, in the run function you need to pass the number of sample read or written to the read and write buffer functions:

```c
this->getWriteBuffer(nbOfSamplesForCurrentExecution)
```

For synchronous node, nothing is changed and they are coded as in the previous versions.

The drawback of cyclo static scheduling is that the schedule length is increased. If we take the first example with a source producing 5 samples and a node consuming 7 samples and if the source is replaced by another source producing [5,5] then it is not equivalent. In the first case we can have only one execution of the source. In the second case, the scheduling will always contain an even number of executions of the sources. So the schedule length will be bigger. But memory usage will be the same (FIFOs of same size).

Since schedule tend to be bigger with cyclo static scheduling, a new code generation mode has been introduced and is enabled by default : now instead of having a sequence of function calls, the schedule is coded by an array of number and there is a switch / case to select the function to be called.
 No newline at end of file
+147 −0
Original line number Diff line number Diff line
# Dynamic Data Flow

Versions of the compute graph corresponding to CMSIS-DSP Version >= `1.14.3` and Python wrapper version >= `1.10.0` are supporting  a new dynamic / asynchronous mode.

 With a dynamic flow, the flow of data is potentially changing at each execution. The IOs can generate or consume a different amount of data at each execution of their node (including no data).

This can be useful for sample oriented use cases where not all samples are available but a processing must nevertheless take place each time a subset of samples is available (samples could come from sensors).

With a dynamic flow and scheduling, there is no more any way to ensure that there won't be FIFO underflow of overflow due to scheduling. As consequence, the nodes must be able to check for this problem and decide what to do.

* A sink may decide to generate fake data in case of FIFO underflow
* A source may decide to skip some data in case of FIFO overflow
* Another node may decide to do nothing and skip the execution
* Another node may decide to raise an error.

With dynamic scheduling, a node must implement the function `prepareForRunning` and decide what to do.

3 error / status codes are reserved for this. They are defined in the header `cg_status.h`. This header is not included by default, but if you define you own error codes, they should be coherent with `cg_status` and use the same values for the 3 status / error codes which are used in dynamic mode:

* `CG_SUCCESS`  = 0 : Node can execute
* `CG_SKIP_EXECUTION` = -5 : Node will skip the execution
* `CG_BUFFER_ERROR` = -6 : Unrecoverable error due to FIFO underflow / overflow (only raised in pure function like CMSIS-DSP ones called directly)

Any other returned value will stop the execution.

The dynamic mode (also named asynchronous), is enabled with option : `asynchronous`

The system will still compute a scheduling and FIFO sizes as if the flow was static. We can see the static flow as an average of the dynamic flow. In dynamic mode, the FIFOs may need to be bigger than the ones computed in static mode.  The static estimation is giving a first idea of what the size of the FIFOs should be. The size can be increased by specifying a percent increase with option `FIFOIncrease`.

For pure compute functions (like CMSIS-DSP ones), which are not packaged into a C++ class, there is no way to customize the decision logic in case of a problem with FIFO. There is a global option : `asyncDefaultSkip`. 

When `true`, a pure function that cannot run will just skip the execution. With `false`, the execution will stop. For any other decision algorithm, the pure function needs to be packaged in a C++ class.

`Duplicate` nodes are skipping the execution in case of problems with FIFOs. If it is not the wanted behavior, you can either:

* Replace the Duplicate class by a custom one by changing the class name with option `duplicateNodeClassName` on the graph.
* Don't use the automatic duplication feature and introduce your duplicate nodes in the compute graph

When you don't want to generate or consume data in a node, just don't call the functions `getReadBuffer` or `getWriteBuffer` for your IOs.

## prepareForRunning

The method `prepareForRunning` is needed to check if the node execution is going to be possible.

Inside this function, you have access to methods like:

* `willOverflow`
* `willUnderflow`

In case of several IOs, you may also have:

* `willOverflow1`
* `willOverflow2` 

etc ...

The functions have an interface like:

```C++
bool willOverflow(int nb = outputSize)
```

or

```C++
bool willUnderflow(int nb = inputSize)
```

The `inputSize` and `outputSize` are coming from the template arguments. So, by default the node is using the parameters of the static compute graph.

You may want to read or write more or less than what is defined in the static compute graph. But it must be coherent with the `run` function.

If you use `willOverflow(4)` to check if you can write `4` samples in the output in the `prepareForRunning` function, then in the `run` function you must access to the write buffer by requesting `4` samples with `getWriteBuffer(4)`

If you don't want to write or read on an IO, just don't use the function `getWriteBuffer` and `getReadBuffer` in the `run` function.

It is also possible to use the functions `willOverflow`, `willUnderflow` in the `run` function. It can be used to avoid calling the `getReadBuffer` and `getWriteBuffer` when you nevertheless want to run the node although some FIFOs cannot be used.

**WARNING**: You are responsible for checking if a FIFO is going to underflow or overflow **before** using `getReadBuffer` or `getWriteBuffer`.

If the `getReadBuffer` and `getWriteBuffer` are causing an underflow or overflow of the FIFO, you'll have memory corruptions and the compute graph will no more work.

## Graph constraints

The dynamic / asynchronous mode is using a synchronous graph as average / ideal case. But it is important to understand that we are no more in static / synchronous mode and some static graph may be too complex for the dynamic mode. Let's take the following graph as example:

![async_topological2](documentation/async_topological2.png)

The generated schedule is:

```
src
src
src
src
src
filter
sink
sink
sink
sink
sink
```

If we use a strategy of skipping the execution of a node in case of overflow / underflow, what will happen is:

* Schedule execution 1
  * First `src` node execution is successful since there is a sample
  * All other execution attempts will be skipped 
* Schedule execution 2
  * First `src` node execution is successful since there is a sample
  * All other execution attempt will be skipped 
* ...
* Schedule execution 5:
  * First `src` node execution is successful since there is a sample
  * 4 other `src` node executions are skipped
  * The `filter` execution can finally take place since enough data has been generated



In summary , it is totally useless in asynchronous mode to attempt to run the same node several times in the same scheduling iteration except if we are sure there will always be enough data. In previous example, we see that only the first attempt at running `src` is doing something. Other attempts are always skipped.



Instead, one could try the following graph:

![async_topological1](documentation/async_topological1.png)

With this graph, each node execution will be attempted only once during an execution.

But the `filter` needs 5 samples, so we need to increase the size of the FIFOs from `1` to `5` or the `filter` node will never be executed. 

It is possible with the option `FIFOIncrease` but it is better to make it explicit with the following graph:

![async_topological3](documentation/async_topological3.png)

In this case, the FIFO is big enough. `src` node will be executed each time there is a sample. `filter` will execute only when 5 samples have been accumulated in the FIFO. Each node execution is only attempted once during a schedule.



As consequence, the recommendation in dynamic / asynchronous mode is to:

* Ensure that the amount of data produced and consumed on each FIFO end is the same (so that each node execution is attempted only once during a schedule)
* Use the maximum amount of samples required on both ends of the FIFO
  * Here `sink` is generating  at most `1` sample, `filter` needs 5. So we use `5` on both ends of the FIFO
* More complex graphs will create a useless overhead in dynamic / asynchronous mode
+16 −8
Original line number Diff line number Diff line
@@ -21,15 +21,13 @@ The following matrix `M` is created from the previous graph. The first column re

![math-matrix1](documentation/math-matrix1.png)

The first row thus mean that an execution of the filter is consuming 7 samples on the first edge and execution of the source is producing 5 samples. The sink is not connected to the first edge so the value is 0.


The first row means that an execution of the filter is consuming 7 samples on the first edge and execution of the source is producing 5 samples. The sink is not connected to the first edge so the value is 0.

If a node is run `nb` times then the matrix can be used to compute the state of the edges after this execution.

A vector `s` can be used to represent how many time each node is executed. Then `M.s` is the amount of data produced / consumed on each edge.

If `f` is the state of the edges (amount of data on each edge) then we have:
If `f` is the state of the edges (amount of data on each edge) then, after execution of the nodes as described with `s`, we have:

`f' = M . s + f`

@@ -39,11 +37,21 @@ If we want to find a scheduling of this graph allowing to stream samples from th

`M . s = 0`

The theory is showing that if the graph is schedulable, the space of solution has dimension 1. So we can find a solution with minimal integer values for the coefficients by just scaling any solution.
The theory is showing that if the graph is schedulable, the space of solution has dimension 1. So we can find a solution with minimal integer values for the coefficients by :

* Converting the solution (which may be rational) to integers
* Using the greatest common divider to find the smallest solution

In the above example, we find the scheduling vector : `s={5,5,7}`

Once we know how many time each node must be executed, we can try to find a schedule minimizing the memory usage. The algorithm computes a topological sort of the graph and starts from the sinks. A node is scheduled if it has enough data on its edges : a normalized measure is being used on each edge. The amount of data is not directly used but it is normalized by the amount of data read or produced by the node in a given execution. The idea is to run the node as soon as enough data is available to make the execution of the node possible.
Once we know how many time each node must be executed, we can try to find a schedule minimizing the memory usage. The algorithm computes a topological sort of the graph and starts from the sinks. A node is scheduled if it has enough data on its edges : a normalized measure is being used on each edge. The amount of data is not directly used but it is normalized by the amount of data read or produced by the node in a given execution. The idea is to run the node as soon as enough data is available to make the execution of the node possible:

For instance, the 2 following cases are equivalent for the algorithm:

* A FIFO containing 128 samples and connected to a node consuming 128 samples
* A FIFO containing 1 sample and connected to a node consuming 1 sample

The algorithm is considering those 2 FIFOs as filled in the same way.

The graph is structured in layers : nodes are in the same layer if their distance to the sinks is the same.

@@ -67,5 +75,5 @@ So we can reuse the previous theory if we assume that each node execution is in

Once we have computed the matrix and the scheduling solution, the details of the schedule are computed using a different granularity : the cycles are no more considered as a whole but instead  each execution step inside each cycle is used.

As consequence, the effect of the cyclo static scheduling is just to increase the length of the final scheduling sequence since each node will have to be executed a number of times which is constrained by the least common multiples of the period of the connected nodes.
As consequence, the effect of the cyclo-static scheduling is just to increase the length of the final scheduling sequence since each node will have to be executed a number of times which is constrained by the least common multiples of the period of the connected nodes.
+71 −147

File changed.

Preview size limit exceeded, changes collapsed.

+4.76 KiB
Loading image diff...
Loading