Add RoCEv2AxiStreamRdma module + cocotb testbench#1438
Draft
ruck314 wants to merge 10 commits into
Draft
Conversation
RoCEv2AxiStreamRdma owns the RoCEv2 host interface for an AXI-Stream payload source: it buffers the inbound stream in a store-and-forward repack FIFO, issues one RDMA-WRITE-with-immediate work request per complete buffered packet, serves the engine's DMA read by draining that packet into the 290-bit RoceDmaReadResp, and counts work completions through a single merged AXI-Lite register file. Add a cocotb testbench (with a record-flattening wrapper) that emulates the engine side and drives the payload at full rate with configurable engine latency and backpressure to build deep repack-FIFO occupancy. It checks the dmaReadResp byte-order / isFirst / isLast / byteEn packing against the source, the success/unsuccess completion counters, and uses a watchdog so any dispatch stall fails as a timeout. Enable the tests/ethernet/RoCEv2 directory in surf CI.
Pre-existing VSG when_001 violation: move "else" to the right of the closing ")" of the conditional expression. Whitespace-only.
The lockstep dispatch FSM strands in ST2_DRAIN when the RoCEv2 engine stops draining (e.g. QP teardown on host disconnect): ST2_DRAIN's exit waits on a FIFO drain that never arrives and never tests DispatchEnable, and there is no software-accessible soft reset (only roceRst, which needs an FPGA reload). A partial packet left in the store-and-forward FIFO when the source is cut mid-frame also fuses with the next frame on re-arm. Hold the repack FIFO in reset and force the dispatch + REPACK FSMs to IDLE whenever DispatchEnable=0, so clearing DispatchEnable (the documented stop() path) fully quiesces and flushes the datapath and a stop/restart recovers without an FPGA reload. No new register; the PyRogue device map is unchanged. Add two cocotb cases that reproduce the wedge (engine_teardown_then_restart) and the partial-packet fusion (partial_packet_then_rearm); both are RED on the pre-fix RTL and GREEN after.
…trol
Replace the one-shot RDMA-WRITE streaming source with a SEND-with-immediate
datapath that has native FW<->NIC flow control (no software credit register
in the real-time path):
* FILL copies each complete FIFO packet into an addressable replay-RAM slot
(a power-of-2 ring), gated on a free slot.
* SERVE replays the wr_id-addressed slot READ-ONLY, so a blue-rdma RNR/timeout
retry that re-issues the DMA read for the same wr_id re-reads byte-identical
bytes -- the property the old one-shot source lacked.
* DISPATCH issues one IBV_WR_SEND_WITH_IMM (opCode 0x3) per filled slot; RETH
(rAddr/rKey) is driven to 0 so a full host RQ makes the NIC RNR-NAK.
* COMPLETION frees the oldest slot per work completion (ACK-paced). "No free
slot -> FILL stalls -> FIFO fills -> PRBS source backpressured" is the
FW-internal, ACK-driven, software-free flow-control loop.
Guardrails: assert configured Len <= MAX_BEATS_G*32 (one PMTU/slot, the
whole-slot-replay precondition), and drain an oversized packet's tail to tLast
(F_FLUSH) so it cannot misframe the next packet.
cocotb: 12/12 pass (adds send-opcode/zeroed-RETH, immediate channel/slot,
retry-rereads-same-payload, ring-backpressure, oversized-reframe). Validated
end-to-end on KCU105 + ConnectX-5: clean PRBS at 25 kHz and at TrigDly=0
(free-run) with rxErrors=0 -- native RNR backpressure self-paces the source.
Instantiate surf.AxiStreamMon on the FIFO drain stream (fifoMaster/fifoSlave)
to measure frame count/size/rate/bandwidth of the PRBS packets drained into
the replay ring. Single clock (statusClk = axisClk = roceClk, COMMON_CLK_G).
* New generic ROCE_CLK_FREQ_G (real, default 156.25E+6) -> AXIS_CLK_FREQ_G.
* Status outputs exposed read-only on the merged AXI-Lite map at a new 0x200
block (frameCnt, frameRate/max/min, bandwidth/max/min, frameSize/max/min).
* monRst = roceRst or resetCounters: a ResetCounters (0x108) write now clears
the monitor statistics (frameCnt + all min/max) alongside the FW counters.
cocotb: 13/13 (adds reset_counters_clears_axistreammon). Validated on KCU105 +
ConnectX-5: frameRate=25 kHz, frameSize=4064 B, bandwidth=0.813 Gb/s match the
configured stream, and root.CountReset() zeroes all monitor registers.
Regenerate mkQP.v from a locally-patched RetryHandleSQ.bsv so the SQ treats rnr_retry=7 as infinite per the IB spec. Previously disableRetryCntReg (which gates both the timeout AND the RNR retry-count decrements) was derived only from getMaxRetryCnt; with retry_count=3 and rnr_retry=7 the RNR counter (rnrCntReg) still decremented and hit 0 after 7 RNR NAKs -> RETRY_LIMIT_EXC -> SQ ERROR -> permanent datapath wedge (only a QP recreate / transport soft-reset recovered). BSV fix (separate RNR disable flag), applied to a local clone of FilMarini/blue-rdma src/RetryHandleSQ.bsv and regenerated with bsc 2023.01: * add disableRnrCntReg <= (getMaxRnrCnt == INFINITE_RETRY) * gate the RNR rnrCntReg decrement on !disableRnrCntReg * RNR limit check: !disableRnrCntReg && isZero(rnrCntReg) The timeout-retry path (getMaxRetryCnt/disableRetryCntReg) is unchanged. NOTE: the BSV fix is NOT upstreamed to the fork; a future regeneration from FilMarini/blue-rdma must re-apply it. Only mkQP.v changed (the SQ retry handler); mkTransportLayer.v and mkAxiSTransportLayer.v regenerate byte-identical, so the top-level interface is unchanged. Validated on KCU105 + ConnectX-5: a 10 s SW block of PrbsRx (rxEnable=False) at the default min_rnr_timer now recovers losslessly (rxErrors=0, stream resumes at full rate) where it previously wedged; normal 25 kHz PASS and TrigDly=0 max-rate remain clean.
RoCEv2AxiStreamRdma now measures each SEND's length per-packet from the inbound tLast (FILL stores the drained byte count per replay slot; DISPATCH drives workReq.len from it) instead of a software-programmed Len register, so the PRBS PacketLength can change at runtime without a stop/reload. - MAX_BEATS_G generic -> MAX_BEATS_C constant (128); add MAX_FRAME_BYTES_C (= MAX_BEATS_C*32 = one PMTU) as the per-SEND cap, with a compile-time assert that it fits the engine's 13-bit DMA-read len field. - Offset 0x04 Len (RW) -> MaxSize (RO) readback of the FW cap; drop the static lenMatch error check (the slot-full condition is the terminator). - Replay a PARTIAL final beat correctly: bitReverse(byteEn) so the valid bytes that endianSwap moves to the high lanes are marked there, and only flag a partial beat that is NOT the last (a partial final beat is legitimate). - Move the RoCEv2AxiStreamRdma PyRogue device into surf (python/surf/ethernet/roce); MaxSize is RO. - Tests: add dynamic_frame_size, partial_final_beat_byteen, maxsize_reads_fw_constant; retarget the oversized test at the cap (16/16).
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dcqcnEn #1438 +/- ##
==========================================
Coverage ? 25.05%
==========================================
Files ? 265
Lines ? 21535
Branches ? 0
==========================================
Hits ? 5396
Misses ? 16139
Partials ? 0 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
A frame whose length exceeds the per-SEND cap (MaxSize = one PMTU) is now DROPPED in FILL (new F_DROP state): its tail is flushed and the slot is NOT published, so no isRespErr SEND is dispatched. An errored SEND would put the blue-rdma SQ into its ERROR state (statusSQ_comm_isERR), which only a QP reset (SW) clears, wedging the datapath until a GUI restart. Dropping keeps the SQ healthy so the path self-heals the moment the frame size returns to <= MaxSize, with no SW involvement. - New OversizeCount RO register at 0x10C (count of over-cap frames dropped), cleared by ResetCounters. - cocotb: oversized_packet_dropped_and_reframes verifies the over-cap frame produces no SEND, OversizeCount increments, and the next frame dispatches cleanly (16/16 pass).
…SEND Regenerate the blue-rdma transport core with MAX_QP_WR raised 4 -> 16 (src/Settings.bsv). The RDMA-SEND datapath is bandwidth-delay-product limited: with only ~4 SENDs in flight against a ~2.5 us completion latency, MonBandwidth caps at ~8.1 Gb/s. A 16-deep SQ hides the latency and reaches line rate (8.10 -> 9.75 Gb/s at 4096 B, frame rate 247 -> 298 kHz), PRBS integrity intact, timing closed. RING_SLOTS_G (16) already matches the deeper window. The rnr_retry=7-as-infinite RNR fix is preserved.
A SoftReset (0xF50) resets the transport core but not this configurator (the one-shot must survive its own pulse). If the pulse lands while a metadata exchange is in flight, the core never answers the pre-reset request and the FSM stalls in GET_RESPONSE_S, ignoring the next SendMetaData -- RecvMetaData never returns to 1 and a software reconnect wedges until an FPGA reload. Snap the FSM to IDLE_S (and drop the in-flight tx) for the soft-reset pulse so the configurator comes out resynced with the core.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Upstreams
RoCEv2AxiStreamRdma(developed in the Simple-10GbE-RUDP-KCU105-Example) into surf's RoCEv2 RTL, and adds a cocotb testbench that fills the module's previously-missing test coverage.RoCEv2AxiStreamRdmaowns the RoCEv2 host interface for an AXI-Stream payload source:VALID_THOLD_G=0, 32-byte internal width),RoceDmaReadResp(endianSwap(data) & byteEn & isFirst & isLast),It exposes
GEN_SYNC_FIFO_G+ a separatesAxisClk/sAxisRstso the payload source can live in its own clock domain.Testbench
tests/ethernet/RoCEv2/test_RoCEv2AxiStreamRdma.py(+ a record-flatteningRoCEv2AxiStreamRdmaWrapper.vhd, mirroringRoceConfiguratorWrapper) emulates the engine side (workReqaccept → onedmaReadReq→ multi-beatdmaReadRespdrain → successworkComp) and drives the slave payload at full rate with configurable engine latency and backpressure to build deep repack-FIFO occupancy. It checks:dmaReadRespbyte-order /isFirst/isLast/byteEnagainst the source (catches lane-swap, drop, duplicate, reorder),SuccessCounter/UnsuccessCounter,Cases: single-packet sanity, full-rate high-occupancy,
dmaReadRespbackpressure, an all-knobs-adversarial run, and a mid-stream engine stall (FIFO saturates then resumes). All pass under GHDL; the module is robust under full-rate occupancy/backpressure with no deadlock.CI
Enables the
tests/ethernet/RoCEv2directory insurf_ci.yml(all 7 tests in that directory pass locally).Status
Draft — opening for CI and review.