Unverified Commit 4806c7f0 authored by Christophe Favergeon's avatar Christophe Favergeon Committed by GitHub
Browse files

Streamdoc (#94)

Reworked the documentation and examples for the compute graph.
parent d7e4dea5
Loading
Loading
Loading
Loading
+9 −9
Original line number Diff line number Diff line
# Dynamic Data Flow

This feature is illustrated in the [Example 10 : The dynamic dataflow mode](examples/example10/README.md)

Versions of the compute graph corresponding to CMSIS-DSP Version >= `1.14.3` and Python wrapper version >= `1.10.0` are supporting  a new dynamic / asynchronous mode.

 With a dynamic flow, the flow of data is potentially changing at each execution. The IOs can generate or consume a different amount of data at each execution of their node (including no data).
@@ -13,7 +15,7 @@ With a dynamic flow and scheduling, there is no more any way to ensure that ther
* Another node may decide to do nothing and skip the execution
* Another node may decide to raise an error.

With dynamic scheduling, a node must implement the function `prepareForRunning` and decide what to do.
With dynamic flow, a node must implement the function `prepareForRunning` and decide what to do.

3 error / status codes are reserved for this. They are defined in the header `cg_status.h`. This header is not included by default, but if you define you own error codes, they should be coherent with `cg_status` and use the same values for the 3 status / error codes which are used in dynamic mode:

@@ -23,9 +25,9 @@ With dynamic scheduling, a node must implement the function `prepareForRunning`

Any other returned value will stop the execution.

The dynamic mode (also named asynchronous), is enabled with option : `asynchronous`
The dynamic mode (also named asynchronous), is enabled with option : `asynchronous` of the configuration object used with the scheduling functions.

The system will still compute a scheduling and FIFO sizes as if the flow was static. We can see the static flow as an average of the dynamic flow. In dynamic mode, the FIFOs may need to be bigger than the ones computed in static mode.  The static estimation is giving a first idea of what the size of the FIFOs should be. The size can be increased by specifying a percent increase with option `FIFOIncrease`.
The system will still compute a synchronous scheduling and FIFO sizes as if the flow was static. We can see the static flow as an average of the dynamic flow. In dynamic mode, the FIFOs may need to be bigger than the ones computed in static mode.  The static estimation is giving a first idea of what the size of the FIFOs should be. The size can be increased by specifying a percent increase with option `FIFOIncrease`.

For pure compute functions (like CMSIS-DSP ones), which are not packaged into a C++ class, there is no way to customize the decision logic in case of a problem with FIFO. There is a global option : `asyncDefaultSkip`. 

@@ -82,7 +84,7 @@ If the `getReadBuffer` and `getWriteBuffer` are causing an underflow or overflow

## Graph constraints

The dynamic / asynchronous mode is using a synchronous graph as average / ideal case. But it is important to understand that we are no more in static / synchronous mode and some static graph may be too complex for the dynamic mode. Let's take the following graph as example:
The dynamic mode is using a synchronous graph as average / ideal case. But it is important to understand that we are no more in static / synchronous mode and some static graph may be too complex for the dynamic mode. Let's take the following graph as example:

![async_topological2](documentation/async_topological2.png)

@@ -104,14 +106,14 @@ sink

If we use a strategy of skipping the execution of a node in case of overflow / underflow, what will happen is:

* Schedule execution 1
* Schedule iteration  1
  * First `src` node execution is successful since there is a sample
  * All other execution attempts will be skipped 
* Schedule execution 2
* Schedule iteration  2
  * First `src` node execution is successful since there is a sample
  * All other execution attempt will be skipped 
* ...
* Schedule execution 5:
* Schedule iteration  5:
  * First `src` node execution is successful since there is a sample
  * 4 other `src` node executions are skipped
  * The `filter` execution can finally take place since enough data has been generated
@@ -143,5 +145,3 @@ As consequence, the recommendation in dynamic / asynchronous mode is to:
* Ensure that the amount of data produced and consumed on each FIFO end is the same (so that each node execution is attempted only once during a schedule)
* Use the maximum amount of samples required on both ends of the FIFO
  * Here `sink` is generating  at most `1` sample, `filter` needs 5. So we use `5` on both ends of the FIFO
* More complex graphs will create a useless overhead in dynamic / asynchronous mode
+2 −0
Original line number Diff line number Diff line
# Cyclo static scheduling

This feature is illustrated in the  [cyclo](examples/cyclo/README.md) example.

Beginning with the version `1.7.0` of the Python wrapper and version >= `1.12` of CMSIS-DSP, cyclo static scheduling has been added.

## What is the problem it is trying to solve ?
+128 −60
Original line number Diff line number Diff line
@@ -20,8 +20,8 @@ The read buffer and write buffers used to interact with a FIFO have the alignmen

If the number of samples read is `NR` and the number of samples written if `NW`, the alignments (in number of samples) may be:

* `r0 . NR` (where `r0 ` if an integer with `r0 >= 0`)
* `w . NW - r1 . NR` (where `r1 ` and `w` are integers with `r1 >= 0` and `w >= 0`)
* `r0 . NR` for a read buffer in the FIFO (where `r0 ` if an integer with `r0 >= 0`)
* `w . NW - r1 . NR` for a write buffer in the FIFO (where `r1 ` and `w` are integers with `r1 >= 0` and `w >= 0`)

If you need a stronger alignment, you'll need to chose `NR` and `NW` in the right way.

@@ -29,7 +29,7 @@ For instance, if you need an alignment on a multiple of `16` bytes with a buffer

If you can't choose freely the values of `NR` and `NW` then you may need to do a copy inside your component to align the buffer (of course only if the overhead due to the lack of alignment is bigger than doing a copy.)

## Memory sharing
## Memory sharing example

When the `memoryOptimization` is enabled, the memory may be reused for different FIFOs to minimize the memory usage. But the scheduling algorithm is not trying to optimize this. So depending on how the graph was scheduled, the level of sharing may be different.

@@ -42,21 +42,23 @@ If you share memory, you are using reference semantic and it should be hidden fr
One could define an audio buffer data type :

```c++
template<int nbSamples,int refCount>
template<int nbSamples,
         int refCount>
struct SharedAudioBuf
{
 float32_t *buf;
 static int getNbSamples() {return nbSamples;};
};

template<int nbSamples,int refCount>
template<int nbSamples,
         int refCount>
using SharedBuf = struct SharedAudioBuf<nbSamples,refCount>;

```

The template tracks the number of samples and the reference count.
The template tracks the number of samples and the reference count statically. `refCount` is not a value of the struct. It is a template argument : a number at type level.

The FIFO are no more containing the float samples but only the shared buffers.
The FIFOs are no more containing the audio samples but only a pointer to a shared buffers of samples.

In this example, instead of having a length of 128 `float` samples, a FIFO would have a length of one `SharedBuf<128,r>` samples.

@@ -64,7 +66,7 @@ An example of compute graph could be:

![shared_buffer](documentation/shared_buffer.png)

The copy of a `SharedBuf<NB,REF>` is copying a pointer to a buffer and not the buffer. It is reference semantic and the buffer should not be modified if the ref count if > 1.
A copy of the struct `SharedBuf<NB,REF>` is copying a pointer to a buffer and not the buffer. It is reference semantic and the buffer should not be modified if the ref count is > 1.

In the above graph, there is a processing node doing in-place modification of the buffer and it could have a template specialization defined as:

@@ -84,7 +86,7 @@ public GenericNode<SharedBuf<NB,1>,1,
The meaning is:

* The input and output FIFOs have a length of 1 sample
* The sample has a type `SharedBuf<NB,1>`
* The sample has a type `SharedBuf<NB,1>` for both input and output
* The reference count is statically known to be 1 so it is safe to do in place modifications of the buffer and the output buffer is a pointer to the input one

In case of duplication, the template specialization could look like:
@@ -257,67 +259,133 @@ public:

The `input` and `output` arrays, used in the sink / source, are defined as extern. The source is reading from `input` and the sink is writing to `output`.

If we look at the asm code generated with `-Ofast` with armclang `AC6` and for one iteration of the schedule, we get:
The generated scheduler is:

```txt
PUSH     {r4-r6,lr}
MOVW     r5,#0x220
MOVW     r1,#0x620
MOVT     r5,#0x3000
MOV      r4,r0
MOVT     r1,#0x3000
MOV      r0,r5
MOV      r2,#0x200
BL       __aeabi_memcpy4 ; 0x10000a94
MOVW     r6,#0x420
MOV      r0,r5
MOVT     r6,#0x3000
MOVS     r2,#0x80
VMOV.F32 s0,#0.5
MOV      r1,r6
BL       arm_offset_f32 ; 0x10002cd0
MOV      r0,#0x942c
MOV      r1,r6
MOVT     r0,#0x3000
MOV      r2,#0x200
BL       __aeabi_memcpy4 ; 0x10000a94
MOVS     r1,#0
MOVS     r0,#1
STR      r1,[r4,#0]
POP      {r4-r6,pc}
```

It is the code you would get if you was manually writing a call to the corresponding CMSIS-DSP function. All the C++ templates have disappeared. The switch / case used to implement the scheduler has also been removed.
```C++
uint32_t scheduler(int *error)
{
    int cgStaticError=0;
    uint32_t nbSchedule=0;
    int32_t debugCounter=1;

    CG_BEFORE_FIFO_INIT;
    /*
    Create FIFOs objects
    */
    FIFO<float32_t,FIFOSIZE0,1,0> fifo0(buf0);
    FIFO<float32_t,FIFOSIZE1,1,0> fifo1(buf1);

    CG_BEFORE_NODE_INIT;
    /* 
    Create node objects
    */
    ProcessingNode<float32_t,128,float32_t,128> proc(fifo0,fifo1);
    Sink<float32_t,128> sink(fifo1);
    Source<float32_t,128> source(fifo0);

    /* Run several schedule iterations */
    CG_BEFORE_SCHEDULE;
    while((cgStaticError==0) && (debugCounter > 0))
    {
        /* Run a schedule iteration */
        CG_BEFORE_ITERATION;
        for(unsigned long id=0 ; id < 3; id++)
        {
            CG_BEFORE_NODE_EXECUTION;

The code was generated with `memoryOptimization` enabled and the Python script detected in this case that the FIFOs are used as arrays. As consequence, there is no FIFO update code. They are used as normal arrays.
            switch(schedule[id])
            {
                case 0:
                {
                   cgStaticError = proc.run();
                }
                break;

The generated code is as efficient as something manually coded.
                case 1:
                {
                   cgStaticError = sink.run();
                }
                break;

The sink and the sources have been replaced by a `memcpy`. The call to the CMSIS-DSP function is just loading the registers and branching to the CMSIS-DSP function.
                case 2:
                {
                   cgStaticError = source.run();
                }
                break;

The input buffer `input` is at address `0x30000620`.
                default:
                break;
            }
            CG_AFTER_NODE_EXECUTION;
            CHECKERROR;
        }
       debugCounter--;
       CG_AFTER_ITERATION;
       nbSchedule++;
    }

The `output` buffer is at address `0x3000942c`.
errorHandling:
    CG_AFTER_SCHEDULE;
    *error=cgStaticError;
    return(nbSchedule);
}
```

We can see in the code:
If we look at the asm of the scheduler generated for a Cortex-M7 with `-Ofast` with armclang `AC6.19` and for **one** iteration of the schedule, we get (disassembly is from uVision IDE):

```txt
MOVW     r1,#0x620
...
MOVT     r1,#0x3000
0x000004B0 B570      PUSH          {r4-r6,lr}
    97:             b[i] = input[i]; 
0x000004B2 F2402518  MOVW          r5,#0x218
0x000004B6 F2406118  MOVW          r1,#0x618
0x000004BA F2C20500  MOVT          r5,#0x2000
0x000004BE 4604      MOV           r4,r0
0x000004C0 F2C20100  MOVT          r1,#0x2000
0x000004C4 F44F7200  MOV           r2,#0x200
0x000004C8 4628      MOV           r0,r5
0x000004CA F00BF8E6  BL.W          0x0000B69A __aeabi_memcpy4
0x000004CE EEB60A00  VMOV.F32      s0,#0.5
   131:         arm_offset_f32(a,0.5,b,inputSize); 
0x000004D2 F2404618  MOVW          r6,#0x418
0x000004D6 F2C20600  MOVT          r6,#0x2000
0x000004DA 2280      MOVS          r2,#0x80
0x000004DC 4628      MOV           r0,r5
0x000004DE 4631      MOV           r1,r6
0x000004E0 F002FC5E  BL.W          0x00002DA0 arm_offset_f32
    63:             output[i] = b[i]; 
0x000004E4 F648705C  MOVW          r0,#0x8F5C
0x000004E8 F44F7200  MOV           r2,#0x200
0x000004EC F2C20000  MOVT          r0,#0x2000
0x000004F0 4631      MOV           r1,r6
0x000004F2 F00BF8D2  BL.W          0x0000B69A __aeabi_memcpy4
   163:        CG_AFTER_ITERATION; 
   164:        nbSchedule++; 
   165:     } 
   166:  
   167: errorHandling: 
   168:     CG_AFTER_SCHEDULE; 
   169:     *error=cgStaticError; 
   170:     return(nbSchedule); 
0x000004F6 F2402014  MOVW          r0,#0x214
0x000004FA F2C20000  MOVT          r0,#0x2000
0x000004FE 6801      LDR           r1,[r0,#0x00]
0x00000500 3101      ADDS          r1,r1,#0x01
0x00000502 6001      STR           r1,[r0,#0x00]
   171: } 
0x00000504 2001      MOVS          r0,#0x01
0x00000506 2100      MOVS          r1,#0x00
   169:     *error=cgStaticError; 
0x00000508 6021      STR           r1,[r4,#0x00]
0x0000050A BD70      POP           {r4-r6,pc}
```

or

```
MOV      r0,#0x942c
...
MOVT     r0,#0x3000
```
It is the code you would get if you was manually writing a call to the corresponding CMSIS-DSP functions. All the C++ templates have disappeared. The switch / case used to implement the scheduler has also been removed.

just before the `memcpy`
The code was generated with `memoryOptimization` enabled and the Python script detected in this case that the FIFOs are used as arrays. As consequence, there is no FIFO update code. They are used as normal arrays.

The generated code is as efficient as something manually coded.

The sink and the sources have been replaced by a `memcpy`. The call to the CMSIS-DSP function is just loading the registers and branching to the CMSIS-DSP function.

It is not always as ideal as in this example. But it demonstrates that the use of C++ templates and a Python code generator is enabling a low overhead solution to the problem of streaming and compute graph.
+98 −0
Original line number Diff line number Diff line
# Introduction

Embedded systems are often used to implement streaming solutions : the software is processing and / or generating stream of samples. The software is made of components that have no concept of streams : they are working with buffers. As a consequence, implementing a streaming solution is forcing the developer to think about scheduling questions, FIFO sizing etc ...

The CMSIS-DSP compute graph is a **low overhead** solution to this problem : it makes it easier to build streaming solutions by connecting components and computing a scheduling at **build time**. The use of C++ template also enables the compiler to have more information about the components for better code generation.

A dataflow graph is a representation of how compute blocks are connected to implement a streaming processing. 

Here is an example with 3 nodes:

- A source
- A filter
- A sink

Each node is producing and consuming some amount of samples. For instance, the source node is producing 5 samples each time it is run. The filter node is consuming 7 samples each time it is run.

The FIFOs lengths are represented on each edge of the graph : 11 samples for the leftmost FIFO and 5 for the other one.

In blue, the amount of samples generated or consumed by a node each time it is called.

<img src="examples/example1/docassets/graph1.PNG" alt="graph1" style="zoom:100%;" />

When the processing is applied to a stream of samples then the problem to solve is : 

> **how the blocks must be scheduled and the FIFOs connecting the block dimensioned**

The general problem can be very difficult. But, if some constraints are applied to the graph then some algorithms can compute a static schedule at build time.

When the following constraints are satisfied we say we have a Synchronous / Static Dataflow Graph:

- Each node is always consuming and producing the same number of samples (static / synchronous flow)

The CMSIS-DSP Compute Graph Tools are a set of Python scripts and C++ classes with following features:

- A compute graph and its static flow can be described in Python
- The Python script will compute a static schedule and the optimal FIFOs size
- A static schedule is:
  - A periodic sequence of functions calls
  - A periodic execution where the FIFOs remain bounded
  - A periodic execution with no deadlock : when a node is run there is enough data available to run it 
- The Python script will generate a [Graphviz](https://graphviz.org/) representation of the graph 
- The Python script will generate a C++ implementation of the static schedule 
- The Python script can also generate a Python implementation of the static schedule (for use with the CMSIS-DSP Python wrapper)

There is no FIFO underflow or overflow due to the scheduling. If there are not enough cycles to run the processing, the real-time will be broken and the solution won't work. But this problem is independent from the scheduling itself. 

# Why it is useful

Without any scheduling tool for a dataflow graph, there is a problem of modularity : a change on a node may impact other nodes in the graph. For instance, if the number of samples consumed by a node is changed:

- You may need to change how many samples are produced by the predecessor blocks  in the graph (assuming it is possible)
- You may need to change how many times the predecessor blocks must run
- You may have to change the FIFOs sizes

With the CMSIS-DSP Compute Graph (CG) Tools you don't have to think about those details while you are still experimenting with your data processing pipeline. It makes it easier to experiment, add or remove blocks, change their parameters.

The tools will generate a schedule and the FIFOs. Even if you don't use this at the end for a final implementation, the information could be useful : is the schedule too long ? Are the FIFOs too big ? Is there too much latency between the sources and the sinks ?

Let's look at an (artificial) example:

<img src="examples/example1/docassets/graph1.PNG" alt="graph1" style="zoom:100%;" />

Without a tool, the user would probably try to modify the number of samples so that the number of sample produced is equal to the number of samples consumed. With the CG Tools  we know that such a graph can be scheduled and that the FIFO sizes need to be 11 and 5.

The periodic schedule generated for this graph has a length of 19. It is big for such a small graph and it is because, indeed 5 and 7 are not very well chosen values. But, it is working even with those values.

The schedule is (the number of samples in the FIFOs after the execution of the nodes are displayed in the brackets):

```
source [ 5   0]
source [10   0]
filter [ 3   5]
sink   [ 3   0]
source [ 8   0]
filter [ 1   5]
sink   [ 1   0]
source [ 6   0]
source [11   0]
filter [ 4   5]
sink   [ 4   0]
source [ 9   0]
filter [ 2   5]
sink   [ 2   0]
source [ 7   0]
filter [ 0   5]
sink   [ 0   0]
```

At the end, both FIFOs are empty so the schedule can be run again : it is periodic !

The compute graph is focusing on the synchronous / static case but some extensions have been introduced for more flexibility:

* A [cyclo-static scheduling](CycloStatic.md) (nearly static)
* A [dynamic/asynchronous](Async.md) mode

Here is a summary of the different configuration supported by the compute graph. The cyclo-static scheduling is part of the static flow mode.

![supported_configs](documentation/supported_configs.png)
 No newline at end of file
+15 −419

File changed.

Preview size limit exceeded, changes collapsed.

Loading