Multiple streamlined accelerators (streaming partitions) per design, reading from single AXI HP port #1555

Daniel2291 · 2026-04-03T10:32:25Z

Daniel2291
Apr 3, 2026

Hello all!

I want to ask if anybody has worked on a similar data handling case as mine and reached successful results.
First, big thanks to the authors and comunity for the great work which I am using towards my MSc Thesis!

Idea:
I am designing a tiny CNN with less than 1k parameters, 4bit quantization, max PE/SIMD, reaching 60k fps with FINN-R.
I want to further increase the total throughput.

Question:
Has anybody researched options for attaching multiples 2/3/4 of the streamlined accelerator build by FINN to a single AXI SmartConnect/Interconnect block?
Of course making sure the total bandwidth of the HP0 AXI is staying with a margin below the theoretial limit of 3-4GB/s.
So the system would be able to process a batch of 8 e.g. 2 streamlined partitions per HP port in case 4 HP AXI ports are available on the FPGA.

Best regards,
Daniel

Answered by fpjentzsch

Apr 7, 2026

Hi,

thanks for your interest in the project!

So you want to instantiate multiple instances of the same FINN accelerator and feed in/out data via DMA? Then it should be possible to just use multiple of the DMA IP cores generated by FINN to one or multiple SmartConnects and slightly adjust the FINN-generated driver to control multiple DMA cores (instead of just looking for "idma_0", "odma_0").

Please note that there is an outdated PR (#789) that introduces additional parallelism (along "samples"/"pixels"), which could be significantly more efficient that instantiating an accelerator with lower parallelism multiple times due to non-linear resource scaling effects. We used it for this paper: h…

View full answer

fpjentzsch · 2026-04-07T13:39:46Z

fpjentzsch
Apr 7, 2026
Collaborator

Hi,

thanks for your interest in the project!

So you want to instantiate multiple instances of the same FINN accelerator and feed in/out data via DMA? Then it should be possible to just use multiple of the DMA IP cores generated by FINN to one or multiple SmartConnects and slightly adjust the FINN-generated driver to control multiple DMA cores (instead of just looking for "idma_0", "odma_0").

Please note that there is an outdated PR (#789) that introduces additional parallelism (along "samples"/"pixels"), which could be significantly more efficient that instantiating an accelerator with lower parallelism multiple times due to non-linear resource scaling effects. We used it for this paper: https://ieeexplore.ieee.org/document/9933377

I believe @lstasytis looked at a revival of that PR recently. Maybe he can chime in and point you somewhere.

2 replies

Daniel2291 Apr 8, 2026
Author

Hi,

Many thanks for the answer. I was also thinking about modifying the driver, but thought that it can be configured only for single idma/odma pair. I noted the M unrolling parameter (along pixels) in your original papers, but saw in a few discussions that this is not supported currently. In case you believe it might work, I am willing to try.

Best regards,
Daniel

lstasytis Apr 27, 2026

Hi @Daniel2291 , so we've been doing some work on using the MMV for our own model (exact same situation - saturated SIMD/PE, but wanted to push for higher throughput).
There is a forked dev branch in which we implemented modifications to the python/compiler side of finn to support this parameter for MVAU and the Thresholding layers. We just haven't gotten around to consolidating it into a nice clean PR to push upstream yet.

Note that if you want to use the M parameter for the Thresholding layer, you would also need to incorporate a modification made to the hlslib repository (link: https://github.com/MrMudkip9352/finn-hlslib_mmv/tree/dev )

The finn-side of changes branch in question: https://github.com/MrMudkip9352/finn_mmv/tree/dev

Daniel2291 · 2026-04-28T08:41:07Z

Daniel2291
Apr 28, 2026
Author

Hi all, Many thanks for the advice and support. Before going into M>1, I am encountering another issue I am finding hard to solve the past few days. It is related to the correctness of the preprocessing/postprocessing input/output of the CNN. I have read all the discussions and examples, but still my model does not output the expected data (high output in the order of thousands) in real fpga test. It performs well when tested before the build and deploy. My main concerns have been fusing the add/mul into thresholds and for that reason I have tried to copy the CNV example using a preprocessing UINT8 -> float [-1/1] and Quant Identity which are the same as the example: def forward(self, x): x = x * (2.0 / 255.0) x = x - 1.0 return x self.input_quant = QuantIdentity( bit_width=4, min_val=-1.0, max_val=1.0 - 2.0 ** (-7), narrow_range=False, restrict_scaling_type=RestrictValueType.POWER_OF_TWO, return_quant_tensor=True, ) my CNN is having 3ch input, 3 cnv layers, 1 depthwise separable, Global average pool (QuantAvgPool2d) k= image size, and 1d conv classifier 1x24 at the end resulting in a binary output, 1/0. , using build settings from Mobilenet v1, w4a4 quant I belive I managed to have my input solved, I am attaching onnx shapes of the qonnx to finn stage, and final bitstream: [image: image.png] [image: image.png] My output is currently the result of the last conv layer, 8 bit weights x 8 bit activations -> INT24 output. I have been trying to posprocess in software as I do not have any thresholds at the end, but even whem applying the ommited Mul blocks in software, numbers are very big and do not correspond to software in any case. Do you have any general advice on how to debug, search for a solution? Any specific order of the Mul/ Add blocks with respect to multithresholds I have to keep, as I believe this might be causing the weird outputs. Any advice on output blocks in the case of binary decision? I saw that Top-K might not be ideal. I am attaching my qonnx export script for reference, which contains the model. I am completing my MSc defence in TU Delft soon and will be very happy to add the model to the examples if desirable. Again huge thanks for the support. Best regards, Daniel

…

Hi @Daniel2291 <https://github.com/Daniel2291> , so we've been doing some work on using the MMV for our own model (exact same situation - saturated SIMD/PE, but wanted to push for higher throughput). There is a forked dev branch in which we implemented modifications to the python/compiler side of finn to support this parameter for MVAU and the Thresholding layers. We just haven't gotten around to consolidating it into a nice clean PR to push upstream yet. Note that if you want to use the M parameter for the Thresholding layer, you would also need to incorporate a modification made to the hlslib repository (link: https://github.com/MrMudkip9352/finn-hlslib_mmv ) The finn-side of changes branch in question: https://github.com/MrMudkip9352/finn_mmv/tree/dev — Reply to this email directly, view it on GitHub <#1555 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BDL36WC6K7TGUCKIDMNRMYT4X5MMNAVCNFSM6AAAAACXLVPDE6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTMNZSHEZDOMA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

fpjentzsch Apr 28, 2026
Collaborator

Your screenshots are not loading for me.

If both, node-per-node and stitched-ip RTL simulation is okay but it fails on hardware, I would double check if the driver is using the correct datatypes and shapes for input/output folding/packing/unpacking/unfolding.

Are you using the latest dev branch? Maybe you are encountering a bug that was fixed in the meantime?

Daniel2291 · 2026-04-28T12:54:28Z

Daniel2291
Apr 28, 2026
Author

Hi Felix, Thank you for the fast response., I am using finn v0.10.1-6-g8ac41e46 I tried to implement the verification steps and it is showing correctness up to the convert_to_hw step and fails in the cppsim: sample=007 ok=True ref_shape=(1, 1) got_shape=(1, 1) max_abs_diff=4.57764e-05 sample=008 ok=True ref_shape=(1, 1) got_shape=(1, 1) max_abs_diff=6.10352e-05 sample=009 ok=True ref_shape=(1, 1) got_shape=(1, 1) max_abs_diff=1.52588e-05 [SUMMARY] custom_step_dw_convert_to_hw_layers: 10/10 passed === folded_hls_cppsim === sample=000 ok=False ref_shape=(1, 1) got_shape=(1, 1) max_abs_diff=48014.1 [FAIL] saved debug context to verify_debug_semnet reference: [[-137.6152]] got : [[47876.465]] sample=001 ok=False ref_shape=(1, 1) got_shape=(1, 1) max_abs_diff=47385.2 [FAIL] saved debug context to verify_debug_semnet reference: [[251.37274]] got : [[47636.59]] sample=002 ok=False ref_shape=(1, 1) got_shape=(1, 1) max_abs_diff=48056.2 [FAIL] saved debug context to verify_debug_semnet reference: [[-144.09836]] got : [[47912.12]] I am attaching the end of the build for the 2 cases, convert to HW and the next, creating a dataflow partition. It can be noted that the Mul blocks near the output disappear in the partition creation. My last convolution is a QuantConv2d (kernel_size = 1), maybe iIcan try with Quant linear instead. Overall the large accumulation from the average pool before might be creating those outputs. <img width="321" height="1011" alt="create_dataflow_partition" src="https://github.com/user-attachments/assets/089373d5-d0e2-4d95-97e2-8766d83920b1" /> <img width="321" height="1011" alt="convert_to_hw" src="https://github.com/user-attachments/assets/1a433521-8a06-45e6-bfdb-c5278d2c1008" /> ``` self.gap1 = qnn.TruncAvgPool2d( kernel_size=16, trunc_quant=TruncTo8bit, ) self.classifier = qnn.QuantConv2d( c3, num_outputs, kernel_size=1, weight_bit_width=4, bias=True, return_quant_tensor=True, ) ``` EDIT1. After some internal debugging I found that VVAU (DW) behaves differently after convert to HW. I am investigating the build flow arround it. EDIT2. (cpp sim passed, rtl fails) What happened to introduce wrong values after VVAU is that in the convert to HW step the PE value has been automatically set to the number of output channels, but the later steps it assumed the one set by the user in the folding.json. When set PE = out channels it work fine for cppsim. In RTL I am getting, AssertionError: This value is not permitted by chosen dtype. Input: Transpose_0_out0 [1, 64, 64, 3] UINT8 UserWarning: The values of tensor MVAU_0_out0 can't be represented with the set datatype annotation (INT32), they will be rounded to match the datatype annotation. Best regards, Daniel

…

On Tue, 28 Apr 2026 at 12:06, Felix Jentzsch ***@***.***> wrote: Your screenshots are not loading for me. If both, node-per-node and stitched-ip RTL simulation is okay but it fails on hardware, I would double check if the driver is using the correct datatypes and shapes for input/output folding/packing/unpacking/unfolding. Are you using the latest dev branch? Maybe you are encountering a bug that was fixed in the meantime? — Reply to this email directly, view it on GitHub <#1555 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BDL36WGWN64W3SH3YRG4VX34YB7DZAVCNFSM6AAAAACXLVPDE6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTMNZUGE2TOMY> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

Uh oh!

Multiple streamlined accelerators (streaming partitions) per design, reading from single AXI HP port #1555

Uh oh!

Daniel2291 Apr 3, 2026

Replies: 3 comments · 3 replies

Uh oh!

fpjentzsch Apr 7, 2026 Collaborator

Uh oh!

Uh oh!

Daniel2291 Apr 8, 2026 Author

Uh oh!

Uh oh!

lstasytis Apr 27, 2026

Uh oh!

Daniel2291 Apr 28, 2026 Author

Uh oh!

fpjentzsch Apr 28, 2026 Collaborator

Uh oh!

Uh oh!

Daniel2291 Apr 28, 2026 Author

Daniel2291
Apr 3, 2026

Replies: 3 comments 3 replies

fpjentzsch
Apr 7, 2026
Collaborator

Daniel2291 Apr 8, 2026
Author

Daniel2291
Apr 28, 2026
Author

fpjentzsch Apr 28, 2026
Collaborator

Daniel2291
Apr 28, 2026
Author