Multiple streamlined accelerators (streaming partitions) per design, reading from single AXI HP port #1555
-
|
Hello all! I want to ask if anybody has worked on a similar data handling case as mine and reached successful results. Idea: Question: Best regards, |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
|
Hi, thanks for your interest in the project! So you want to instantiate multiple instances of the same FINN accelerator and feed in/out data via DMA? Then it should be possible to just use multiple of the DMA IP cores generated by FINN to one or multiple SmartConnects and slightly adjust the FINN-generated driver to control multiple DMA cores (instead of just looking for "idma_0", "odma_0"). Please note that there is an outdated PR (#789) that introduces additional parallelism (along "samples"/"pixels"), which could be significantly more efficient that instantiating an accelerator with lower parallelism multiple times due to non-linear resource scaling effects. We used it for this paper: https://ieeexplore.ieee.org/document/9933377 I believe @lstasytis looked at a revival of that PR recently. Maybe he can chime in and point you somewhere. |
Beta Was this translation helpful? Give feedback.
-
|
Hi all,
Many thanks for the advice and support.
Before going into M>1, I am encountering another issue I am finding hard to
solve the past few days. It is related to the correctness of the
preprocessing/postprocessing input/output of the CNN. I have read all the
discussions and examples, but still my model does not output the expected
data (high output in the order of thousands) in real fpga test. It performs
well when tested before the build and deploy.
My main concerns have been fusing the add/mul into thresholds and for that
reason I have tried to copy the CNV example using a preprocessing UINT8 ->
float [-1/1] and Quant Identity which are the same as the example:
def forward(self, x):
x = x * (2.0 / 255.0)
x = x - 1.0
return x
self.input_quant = QuantIdentity(
bit_width=4,
min_val=-1.0,
max_val=1.0 - 2.0 ** (-7),
narrow_range=False,
restrict_scaling_type=RestrictValueType.POWER_OF_TWO,
return_quant_tensor=True,
)
my CNN is having 3ch input, 3 cnv layers, 1 depthwise separable, Global
average pool (QuantAvgPool2d) k= image size, and 1d conv classifier 1x24 at
the end resulting in a binary output, 1/0. , using build settings from
Mobilenet v1, w4a4 quant
I belive I managed to have my input solved, I am attaching onnx shapes of
the qonnx to finn stage, and final bitstream:
[image: image.png]
[image: image.png]
My output is currently the result of the last conv layer, 8 bit weights x
8 bit activations -> INT24 output. I have been trying to posprocess in
software as I do not have any thresholds at the end, but even whem applying
the ommited Mul blocks in software, numbers are very big and do not
correspond to software in any case.
Do you have any general advice on how to debug, search for a solution?
Any specific order of the Mul/ Add blocks with respect to multithresholds I
have to keep, as I believe this might be causing the weird outputs.
Any advice on output blocks in the case of binary decision? I saw that
Top-K might not be ideal.
I am attaching my qonnx export script for reference, which contains the
model.
I am completing my MSc defence in TU Delft soon and will be very happy to
add the model to the examples if desirable. Again huge thanks for the
support.
Best regards,
Daniel
… Hi @Daniel2291 <https://github.com/Daniel2291> , so we've been doing some
work on using the MMV for our own model (exact same situation - saturated
SIMD/PE, but wanted to push for higher throughput).
There is a forked dev branch in which we implemented modifications to the
python/compiler side of finn to support this parameter for MVAU and the
Thresholding layers. We just haven't gotten around to consolidating it into
a nice clean PR to push upstream yet.
Note that if you want to use the M parameter for the Thresholding layer,
you would also need to incorporate a modification made to the hlslib
repository (link: https://github.com/MrMudkip9352/finn-hlslib_mmv )
The finn-side of changes branch in question:
https://github.com/MrMudkip9352/finn_mmv/tree/dev
—
Reply to this email directly, view it on GitHub
<#1555 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BDL36WC6K7TGUCKIDMNRMYT4X5MMNAVCNFSM6AAAAACXLVPDE6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTMNZSHEZDOMA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
Hi Felix,
Thank you for the fast response., I am using finn v0.10.1-6-g8ac41e46
I tried to implement the verification steps and it is showing correctness
up to the convert_to_hw step and fails in the cppsim:
sample=007 ok=True ref_shape=(1, 1) got_shape=(1, 1)
max_abs_diff=4.57764e-05
sample=008 ok=True ref_shape=(1, 1) got_shape=(1, 1)
max_abs_diff=6.10352e-05
sample=009 ok=True ref_shape=(1, 1) got_shape=(1, 1)
max_abs_diff=1.52588e-05
[SUMMARY] custom_step_dw_convert_to_hw_layers: 10/10 passed
=== folded_hls_cppsim ===
sample=000 ok=False ref_shape=(1, 1) got_shape=(1, 1) max_abs_diff=48014.1
[FAIL] saved debug context to verify_debug_semnet
reference: [[-137.6152]]
got : [[47876.465]]
sample=001 ok=False ref_shape=(1, 1) got_shape=(1, 1) max_abs_diff=47385.2
[FAIL] saved debug context to verify_debug_semnet
reference: [[251.37274]]
got : [[47636.59]]
sample=002 ok=False ref_shape=(1, 1) got_shape=(1, 1) max_abs_diff=48056.2
[FAIL] saved debug context to verify_debug_semnet
reference: [[-144.09836]]
got : [[47912.12]]
I am attaching the end of the build for the 2 cases, convert to HW and the
next, creating a dataflow partition. It can be noted that the Mul blocks
near the output disappear in the partition creation.
My last convolution is a QuantConv2d (kernel_size = 1), maybe iIcan try
with Quant linear instead. Overall the large accumulation from the average
pool before might be creating those outputs.
<img width="321" height="1011" alt="create_dataflow_partition" src="https://github.com/user-attachments/assets/089373d5-d0e2-4d95-97e2-8766d83920b1" />
<img width="321" height="1011" alt="convert_to_hw" src="https://github.com/user-attachments/assets/1a433521-8a06-45e6-bfdb-c5278d2c1008" />
```
self.gap1 = qnn.TruncAvgPool2d(
kernel_size=16,
trunc_quant=TruncTo8bit,
)
self.classifier = qnn.QuantConv2d(
c3,
num_outputs,
kernel_size=1,
weight_bit_width=4,
bias=True,
return_quant_tensor=True,
)
```
EDIT1. After some internal debugging I found that VVAU (DW) behaves differently after convert to HW. I am investigating the build flow arround it.
EDIT2. (cpp sim passed, rtl fails) What happened to introduce wrong values after VVAU is that in the convert to HW step the PE value has been automatically set to the number of output channels, but the later steps it assumed the one set by the user in the folding.json. When set PE = out channels it work fine for cppsim.
In RTL I am getting, AssertionError: This value is not permitted by chosen dtype.
Input: Transpose_0_out0 [1, 64, 64, 3] UINT8
UserWarning: The values of tensor MVAU_0_out0 can't be represented with the set datatype annotation (INT32), they will be rounded to match the datatype annotation.
Best regards,
Daniel
…On Tue, 28 Apr 2026 at 12:06, Felix Jentzsch ***@***.***> wrote:
Your screenshots are not loading for me.
If both, node-per-node and stitched-ip RTL simulation is okay but it fails
on hardware, I would double check if the driver is using the correct
datatypes and shapes for input/output folding/packing/unpacking/unfolding.
Are you using the latest dev branch? Maybe you are encountering a bug that
was fixed in the meantime?
—
Reply to this email directly, view it on GitHub
<#1555 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BDL36WGWN64W3SH3YRG4VX34YB7DZAVCNFSM6AAAAACXLVPDE6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTMNZUGE2TOMY>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
Hi,
thanks for your interest in the project!
So you want to instantiate multiple instances of the same FINN accelerator and feed in/out data via DMA? Then it should be possible to just use multiple of the DMA IP cores generated by FINN to one or multiple SmartConnects and slightly adjust the FINN-generated driver to control multiple DMA cores (instead of just looking for "idma_0", "odma_0").
Please note that there is an outdated PR (#789) that introduces additional parallelism (along "samples"/"pixels"), which could be significantly more efficient that instantiating an accelerator with lower parallelism multiple times due to non-linear resource scaling effects. We used it for this paper: h…