Skip to content

support TBO decode in Deepseek v4#1275

Open
ZhangLirong-amd wants to merge 3 commits into
mainfrom
tbo_decode_v4
Open

support TBO decode in Deepseek v4#1275
ZhangLirong-amd wants to merge 3 commits into
mainfrom
tbo_decode_v4

Conversation

@ZhangLirong-amd

@ZhangLirong-amd ZhangLirong-amd commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Motivation

image

Technical Details

Test Plan

Test Result

Submission Checklist

Copilot AI review requested due to automatic review settings June 18, 2026 03:41

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Two-Batch Overlap (TBO) decode support for the Deepseek V4 stack, with the goal of keeping DP collectives and CUDAGraph/HIPGraph replay stable when decode is split into concurrent micro-batches.

Changes:

  • Add DP-synchronized per-ubatch sizing/metadata in the TBO wrapper so MoE DP collectives use consistent per-ubatch token counts.
  • Introduce Deepseek V4 decode-path adjustments for TBO (stable scratch buffers for graph capture, avoid using padded block_table rows, and disable async/dual-stream paths that are unsafe under concurrent ubatch threads).
  • Add Deepseek V4 attention metadata support for building per-ubatch decode metadata into ub{0,1}_* buffer sets, and tighten TBO capture gating to bs > 2.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
atom/utils/tbo/ubatch_wrapper.py Builds per-ubatch DPMetadata and uses DP-unified per-ubatch padded decode batch sizing; threads receive per-ubatch DP metadata.
atom/models/deepseek_v4.py Adds per-ubatch fixed-address scratch for graph stability; fixes decode top-k sizing under padded TBO metadata; disables async compressor under TBO.
atom/model_ops/moe.py Disables custom CA/IPC all-gather during TBO overlap to avoid cross-thread corruption/deadlock.
atom/model_ops/module_dispatch_ops.py Disables dual-stream MoE forwarding while TBO overlap is active.
atom/model_ops/attentions/deepseek_v4_attn.py Adds TBO decode metadata preparation + per-ubatch buffer allocation and prefixes to avoid cross-ubatch buffer sharing.
atom/model_engine/model_runner.py Updates TBO capture gating from bs >= 2 to bs > 2 for decode.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1232 to +1236
from atom.utils.tbo.ubatch_wrapper import UBatchWrapper

ctx = get_forward_context()
padded_list = [
UBatchWrapper._decode_ub_padded_bs(ctx, i, N, bs) for i in range(N)
Copilot AI review requested due to automatic review settings June 18, 2026 04:19

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comment on lines +1185 to +1190
self._ubatch_decode_meta = None
if (
self.model_runner.config.enable_tbo_decode
and scheduled_bs > 2
and not batch.is_dummy_run
):
Comment on lines +1232 to +1237
from atom.utils.tbo.ubatch_wrapper import UBatchWrapper

ctx = get_forward_context()
padded_list = [
UBatchWrapper._decode_ub_padded_bs(ctx, i, N, bs) for i in range(N)
]
Comment on lines +2314 to 2317
# Create ubatch slices for TBO capture (need > 2 requests)
ubatch_slices = None
if is_tbo and self.config.enable_tbo_decode and bs >= 2:
if is_tbo and self.config.enable_tbo_decode and bs > 2:
ubatch_slices = maybe_create_ubatch_slices(
Comment thread atom/model_ops/moe.py
Comment on lines +251 to 256
from atom.utils.tbo.ubatching import tbo_active

use_cag = use_cag and not tbo_active()
gathered_hidden_states = get_dp_group().all_gather(
padded_x, use_custom=use_cag, dim=0
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants