Skip to content

fix(voice): skip end-of-turn metrics on stale/out-of-order speaking anchor#6098

Open
anshulkulhari7 wants to merge 2 commits into
livekit:mainfrom
anshulkulhari7:fix/eou-metrics-stale-speaking-anchor
Open

fix(voice): skip end-of-turn metrics on stale/out-of-order speaking anchor#6098
anshulkulhari7 wants to merge 2 commits into
livekit:mainfrom
anshulkulhari7:fix/eou-metrics-stale-speaking-anchor

Conversation

@anshulkulhari7

Copy link
Copy Markdown

Summary

Fixes #6093.

For some user turns, the reported transcription_delay / end_of_turn_delay metrics on ChatMessage are extremely large (often >200s) even though the session recordings show no real delay, and stopped_speaking_at can precede started_speaking_at. In other turns the fields are missing entirely.

Root cause

The metrics are computed in _bounce_eou_task (audio_recognition.py) from three captured anchors:

started_speaking_at  = speech_start_time
stopped_speaking_at  = last_speaking_time           # the internal _last_speaking_time anchor
transcription_delay  = max(last_final_transcript_time - last_speaking_time, 0)
end_of_turn_delay    = time.time() - last_speaking_time

The guard around this block only checked that the three values are not None. When the turn detector commits a user turn whose _last_speaking_time was never refreshed for that segment — e.g. consecutive same-role turns split from one continuous utterance, with no VAD speech-stop/start cycle between them — the anchor is left over from an earlier point in the session and can predate the start of the current turn.

In that case the not-None guard still passes, so end_of_turn_delay = now - last_speaking_time becomes ~200s and stopped_speaking_at ends up before started_speaking_at, exactly the payload reported in the issue.

This is the same class of bug noted in #2361 / #5669 / #4388 (stale/0 anchor), now manifesting as an out-of-order anchor on adjacent turns within one long utterance.

Fix

An anchor that predates the start of the turn (last_speaking_time < speech_start_time) is logically impossible — you cannot stop speaking before the turn started. The existing code already has a policy for unreliable timing (see the in-code comment): skip the calculation and report the metrics as None, because that is better than emitting a likely-wrong value. This change extends that same policy to the out-of-order case.

The computation is extracted into a small pure helper, _compute_end_of_turn_metrics, which:

  • returns None for all four metrics when any anchor is missing or when last_speaking_time < speech_start_time (stale/out-of-order), and
  • otherwise returns the same values as before (with end_of_turn_delay now clamped to >= 0, consistent with the existing transcription_delay clamp).

This makes the behaviour directly unit-testable without audio/STT/VAD.

Testing

New unit test module tests/test_end_of_turn_metrics.py exercises the pure helper with crafted timestamps (no audio):

  • test_normal_turn_produces_small_bounded_delays — well-ordered turn yields the expected sub-second delays.
  • test_stale_anchor_predating_turn_start_is_skipped — regression for this issue, using the exact ~220s numbers from the reported payload; all four metrics must be None.
  • test_anchor_equal_to_start_is_accepted — boundary (last_speaking_time == speech_start_time) stays valid.
  • test_missing_anchor_is_skipped — any missing anchor skips the calculation.
$ uv run pytest tests/test_end_of_turn_metrics.py --unit -q
......                                                                   [100%]
6 passed in 0.02s

Confirmed RED before the fix (reverting the ordering guard): test_stale_anchor_predating_turn_start_is_skipped failed with started_speaking_at=1781342804.815377, end_of_turn_delay=220.28458189964294 — i.e. the bogus >200s value. The existing tests/test_speech_start_time_persistence.py still passes.

ruff check, ruff format --check, and mypy are clean on the changed files.

AI disclosure

This change was AI-assisted; all logic, tests, and verification were reviewed by the author.

…nchor

When the turn detector commits a user turn whose _last_speaking_time anchor
was never refreshed for that segment (e.g. consecutive same-role turns split
from one continuous utterance), the anchor can be left over from an earlier
point in the session and predate the start of the current turn. The metric
computation only guarded against None values, so it still produced
transcription_delay / end_of_turn_delay on the order of hundreds of seconds
and a stopped_speaking_at that precedes started_speaking_at.

Treat an out-of-order anchor (last_speaking_time < speech_start_time) the
same as unreliable VAD timing: skip the calculation and report the metrics
as None rather than emitting a likely-wrong value. Extract the computation
into a pure _compute_end_of_turn_metrics helper and add unit tests covering
the normal, boundary, stale-anchor, and missing-anchor cases.

Fixes livekit#6093
@anshulkulhari7 anshulkulhari7 requested a review from a team as a code owner June 14, 2026 13:01
@CLAassistant

CLAassistant commented Jun 14, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

@chenghao-mou chenghao-mou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I have one small comment.



@dataclass
class _EndOfTurnMetrics:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since _EndOfTurnInfo is internal, I would go one step further to replace the four variables in _EndOfTurnInfo with this directly so we don't have to unpack or pass around those values individually.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — folded the four metric fields on _EndOfTurnInfo into a single metrics: _EndOfTurnMetrics. The computed object is now passed straight through; _user_turn_completed_task, _init_metrics_from_end_of_turn, and the turn span read info.metrics.*. mypy strict (593 files) and the unit tests pass.

… duplicated fields

Per review on livekit#6098: _EndOfTurnInfo (internal) carried the same four metric
fields as _EndOfTurnMetrics. Replace them with a single metrics field so the
computed value is passed through directly instead of unpacked and repacked.
Readers (_user_turn_completed_task, _init_metrics_from_end_of_turn) and the
turn span now read info.metrics.*.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

transcription_delay / end_of_turn_delay incorrectly large (>200s) or missing — stopped_speaking_at predates started_speaking_at

3 participants