Skip to content

Add RoCEv2AxiStreamRdma module + cocotb testbench#1438

Draft
ruck314 wants to merge 10 commits into
dcqcnEnfrom
rocev2-axistream-rdma
Draft

Add RoCEv2AxiStreamRdma module + cocotb testbench#1438
ruck314 wants to merge 10 commits into
dcqcnEnfrom
rocev2-axistream-rdma

Conversation

@ruck314

@ruck314 ruck314 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Upstreams RoCEv2AxiStreamRdma (developed in the Simple-10GbE-RUDP-KCU105-Example) into surf's RoCEv2 RTL, and adds a cocotb testbench that fills the module's previously-missing test coverage.

RoCEv2AxiStreamRdma owns the RoCEv2 host interface for an AXI-Stream payload source:

  • buffers the inbound stream in a store-and-forward repack FIFO (VALID_THOLD_G=0, 32-byte internal width),
  • issues one RDMA-WRITE-with-immediate work request per complete buffered packet (event-driven, no software poke),
  • serves the engine's DMA read by draining that packet into the 290-bit RoceDmaReadResp (endianSwap(data) & byteEn & isFirst & isLast),
  • counts work completions through a single merged AXI-Lite register file.

It exposes GEN_SYNC_FIFO_G + a separate sAxisClk/sAxisRst so the payload source can live in its own clock domain.

Testbench

tests/ethernet/RoCEv2/test_RoCEv2AxiStreamRdma.py (+ a record-flattening RoCEv2AxiStreamRdmaWrapper.vhd, mirroring RoceConfiguratorWrapper) emulates the engine side (workReq accept → one dmaReadReq → multi-beat dmaReadResp drain → success workComp) and drives the slave payload at full rate with configurable engine latency and backpressure to build deep repack-FIFO occupancy. It checks:

  • per-beat dmaReadResp byte-order / isFirst / isLast / byteEn against the source (catches lane-swap, drop, duplicate, reorder),
  • SuccessCounter / UnsuccessCounter,
  • a watchdog so any dispatch stall surfaces as a timeout.

Cases: single-packet sanity, full-rate high-occupancy, dmaReadResp backpressure, an all-knobs-adversarial run, and a mid-stream engine stall (FIFO saturates then resumes). All pass under GHDL; the module is robust under full-rate occupancy/backpressure with no deadlock.

CI

Enables the tests/ethernet/RoCEv2 directory in surf_ci.yml (all 7 tests in that directory pass locally).

Status

Draft — opening for CI and review.

ruck314 added 7 commits June 16, 2026 12:20
RoCEv2AxiStreamRdma owns the RoCEv2 host interface for an AXI-Stream payload
source: it buffers the inbound stream in a store-and-forward repack FIFO, issues
one RDMA-WRITE-with-immediate work request per complete buffered packet, serves
the engine's DMA read by draining that packet into the 290-bit RoceDmaReadResp,
and counts work completions through a single merged AXI-Lite register file.

Add a cocotb testbench (with a record-flattening wrapper) that emulates the
engine side and drives the payload at full rate with configurable engine latency
and backpressure to build deep repack-FIFO occupancy. It checks the dmaReadResp
byte-order / isFirst / isLast / byteEn packing against the source, the
success/unsuccess completion counters, and uses a watchdog so any dispatch stall
fails as a timeout. Enable the tests/ethernet/RoCEv2 directory in surf CI.
Pre-existing VSG when_001 violation: move "else" to the right of the closing
")" of the conditional expression. Whitespace-only.
The lockstep dispatch FSM strands in ST2_DRAIN when the RoCEv2 engine stops
draining (e.g. QP teardown on host disconnect): ST2_DRAIN's exit waits on a
FIFO drain that never arrives and never tests DispatchEnable, and there is no
software-accessible soft reset (only roceRst, which needs an FPGA reload). A
partial packet left in the store-and-forward FIFO when the source is cut
mid-frame also fuses with the next frame on re-arm.

Hold the repack FIFO in reset and force the dispatch + REPACK FSMs to IDLE
whenever DispatchEnable=0, so clearing DispatchEnable (the documented stop()
path) fully quiesces and flushes the datapath and a stop/restart recovers
without an FPGA reload. No new register; the PyRogue device map is unchanged.

Add two cocotb cases that reproduce the wedge (engine_teardown_then_restart)
and the partial-packet fusion (partial_packet_then_rearm); both are RED on the
pre-fix RTL and GREEN after.
…trol

Replace the one-shot RDMA-WRITE streaming source with a SEND-with-immediate
datapath that has native FW<->NIC flow control (no software credit register
in the real-time path):

  * FILL copies each complete FIFO packet into an addressable replay-RAM slot
    (a power-of-2 ring), gated on a free slot.
  * SERVE replays the wr_id-addressed slot READ-ONLY, so a blue-rdma RNR/timeout
    retry that re-issues the DMA read for the same wr_id re-reads byte-identical
    bytes -- the property the old one-shot source lacked.
  * DISPATCH issues one IBV_WR_SEND_WITH_IMM (opCode 0x3) per filled slot; RETH
    (rAddr/rKey) is driven to 0 so a full host RQ makes the NIC RNR-NAK.
  * COMPLETION frees the oldest slot per work completion (ACK-paced). "No free
    slot -> FILL stalls -> FIFO fills -> PRBS source backpressured" is the
    FW-internal, ACK-driven, software-free flow-control loop.

Guardrails: assert configured Len <= MAX_BEATS_G*32 (one PMTU/slot, the
whole-slot-replay precondition), and drain an oversized packet's tail to tLast
(F_FLUSH) so it cannot misframe the next packet.

cocotb: 12/12 pass (adds send-opcode/zeroed-RETH, immediate channel/slot,
retry-rereads-same-payload, ring-backpressure, oversized-reframe). Validated
end-to-end on KCU105 + ConnectX-5: clean PRBS at 25 kHz and at TrigDly=0
(free-run) with rxErrors=0 -- native RNR backpressure self-paces the source.
Instantiate surf.AxiStreamMon on the FIFO drain stream (fifoMaster/fifoSlave)
to measure frame count/size/rate/bandwidth of the PRBS packets drained into
the replay ring. Single clock (statusClk = axisClk = roceClk, COMMON_CLK_G).

  * New generic ROCE_CLK_FREQ_G (real, default 156.25E+6) -> AXIS_CLK_FREQ_G.
  * Status outputs exposed read-only on the merged AXI-Lite map at a new 0x200
    block (frameCnt, frameRate/max/min, bandwidth/max/min, frameSize/max/min).
  * monRst = roceRst or resetCounters: a ResetCounters (0x108) write now clears
    the monitor statistics (frameCnt + all min/max) alongside the FW counters.

cocotb: 13/13 (adds reset_counters_clears_axistreammon). Validated on KCU105 +
ConnectX-5: frameRate=25 kHz, frameSize=4064 B, bandwidth=0.813 Gb/s match the
configured stream, and root.CountReset() zeroes all monitor registers.
Regenerate mkQP.v from a locally-patched RetryHandleSQ.bsv so the SQ treats
rnr_retry=7 as infinite per the IB spec. Previously disableRetryCntReg (which
gates both the timeout AND the RNR retry-count decrements) was derived only from
getMaxRetryCnt; with retry_count=3 and rnr_retry=7 the RNR counter (rnrCntReg)
still decremented and hit 0 after 7 RNR NAKs -> RETRY_LIMIT_EXC -> SQ ERROR ->
permanent datapath wedge (only a QP recreate / transport soft-reset recovered).

BSV fix (separate RNR disable flag), applied to a local clone of
FilMarini/blue-rdma src/RetryHandleSQ.bsv and regenerated with bsc 2023.01:
  * add disableRnrCntReg <= (getMaxRnrCnt == INFINITE_RETRY)
  * gate the RNR rnrCntReg decrement on !disableRnrCntReg
  * RNR limit check: !disableRnrCntReg && isZero(rnrCntReg)
The timeout-retry path (getMaxRetryCnt/disableRetryCntReg) is unchanged.

NOTE: the BSV fix is NOT upstreamed to the fork; a future regeneration from
FilMarini/blue-rdma must re-apply it. Only mkQP.v changed (the SQ retry handler);
mkTransportLayer.v and mkAxiSTransportLayer.v regenerate byte-identical, so the
top-level interface is unchanged.

Validated on KCU105 + ConnectX-5: a 10 s SW block of PrbsRx (rxEnable=False) at
the default min_rnr_timer now recovers losslessly (rxErrors=0, stream resumes at
full rate) where it previously wedged; normal 25 kHz PASS and TrigDly=0 max-rate
remain clean.
RoCEv2AxiStreamRdma now measures each SEND's length per-packet from the inbound
tLast (FILL stores the drained byte count per replay slot; DISPATCH drives
workReq.len from it) instead of a software-programmed Len register, so the PRBS
PacketLength can change at runtime without a stop/reload.

- MAX_BEATS_G generic -> MAX_BEATS_C constant (128); add MAX_FRAME_BYTES_C
  (= MAX_BEATS_C*32 = one PMTU) as the per-SEND cap, with a compile-time assert
  that it fits the engine's 13-bit DMA-read len field.
- Offset 0x04 Len (RW) -> MaxSize (RO) readback of the FW cap; drop the static
  lenMatch error check (the slot-full condition is the terminator).
- Replay a PARTIAL final beat correctly: bitReverse(byteEn) so the valid bytes
  that endianSwap moves to the high lanes are marked there, and only flag a
  partial beat that is NOT the last (a partial final beat is legitimate).
- Move the RoCEv2AxiStreamRdma PyRogue device into surf
  (python/surf/ethernet/roce); MaxSize is RO.
- Tests: add dynamic_frame_size, partial_final_beat_byteen,
  maxsize_reads_fw_constant; retarget the oversized test at the cap (16/16).
@codecov-commenter

codecov-commenter commented Jun 17, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 14.62094% with 473 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (dcqcnEn@37a5f1b). Learn more about missing BASE report.

Files with missing lines Patch % Lines
tests/ethernet/RoCEv2/test_RoCEv2AxiStreamRdma.py 14.62% 473 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             dcqcnEn    #1438   +/-   ##
==========================================
  Coverage           ?   25.05%           
==========================================
  Files              ?      265           
  Lines              ?    21535           
  Branches           ?        0           
==========================================
  Hits               ?     5396           
  Misses             ?    16139           
  Partials           ?        0           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ruck314 added 3 commits June 17, 2026 17:08
A frame whose length exceeds the per-SEND cap (MaxSize = one PMTU) is now DROPPED
in FILL (new F_DROP state): its tail is flushed and the slot is NOT published, so
no isRespErr SEND is dispatched. An errored SEND would put the blue-rdma SQ into
its ERROR state (statusSQ_comm_isERR), which only a QP reset (SW) clears, wedging
the datapath until a GUI restart. Dropping keeps the SQ healthy so the path
self-heals the moment the frame size returns to <= MaxSize, with no SW involvement.

- New OversizeCount RO register at 0x10C (count of over-cap frames dropped),
  cleared by ResetCounters.
- cocotb: oversized_packet_dropped_and_reframes verifies the over-cap frame
  produces no SEND, OversizeCount increments, and the next frame dispatches
  cleanly (16/16 pass).
…SEND

Regenerate the blue-rdma transport core with MAX_QP_WR raised 4 -> 16
(src/Settings.bsv). The RDMA-SEND datapath is bandwidth-delay-product
limited: with only ~4 SENDs in flight against a ~2.5 us completion latency,
MonBandwidth caps at ~8.1 Gb/s. A 16-deep SQ hides the latency and reaches
line rate (8.10 -> 9.75 Gb/s at 4096 B, frame rate 247 -> 298 kHz), PRBS
integrity intact, timing closed. RING_SLOTS_G (16) already matches the
deeper window. The rnr_retry=7-as-infinite RNR fix is preserved.
A SoftReset (0xF50) resets the transport core but not this configurator
(the one-shot must survive its own pulse). If the pulse lands while a
metadata exchange is in flight, the core never answers the pre-reset
request and the FSM stalls in GET_RESPONSE_S, ignoring the next
SendMetaData -- RecvMetaData never returns to 1 and a software reconnect
wedges until an FPGA reload. Snap the FSM to IDLE_S (and drop the in-flight
tx) for the soft-reset pulse so the configurator comes out resynced with
the core.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants