Skip to content

feat(voice): surface inference quota errors instead of going silent#6012

Open
shawnfeldman wants to merge 4 commits into
mainfrom
sf/agents-issue-6009
Open

feat(voice): surface inference quota errors instead of going silent#6012
shawnfeldman wants to merge 4 commits into
mainfrom
sf/agents-issue-6009

Conversation

@shawnfeldman

Copy link
Copy Markdown

Summary

Fixes #6009 β€” a voice agent goes silently unresponsive when the LLM endpoint returns HTTP 429 {"type": "inference_quota_exceeded", ...} (e.g. the project is out of LiveKit Inference credits). Today the agent joins the room, publishes its track, and then never speaks: STT transcribes fine, the reply produces no text, TTS is never invoked, and the session absorbs three silent dead turns before closing with no audible or visible signal. The gateway hands us everything we need (type, hint, quota_type) in the response body β€” we just never looked at it.

This PR makes that failure perceptible by default.

What changed

A. Typed detection β€” new APIQuotaExceededError(APIStatusError) exposing quota_type, category, hint, and remaining_limit decoded from the gateway body. It defaults to retryable=False since quota exhaustion won't recover on an immediate retry. create_api_error_from_http and a APIQuotaExceededError.from_response(...) factory construct it whenever the body is an inference_quota_exceeded payload. Exported from livekit.agents.

B. Plugin wiring β€” the inference LLM plugin (also the base class for the OpenAI-compatible livekit-plugins-openai LLMStream) raises the typed error on a 429 quota body, so ev.error / APIError.body are reliable.

C. Surface on the first occurrence β€” AgentSession._on_error no longer absorbs a terminal quota error up to max_unrecoverable_errors. The 3-strike tolerance still applies to transient errors; a known-terminal quota error closes immediately.

D. Perceptible signal, by default β€” a new AgentSession(error_message=...) option speaks a fallback line just before the session closes on an unrecoverable error:

  • omit it (default): quota errors speak the gateway hint (generic fallback if absent); other errors stay silent β€” no behavior change for the generic path.
  • error_message="...": speak that message on any unrecoverable error.
  • error_message=None: disable spoken errors entirely.

Spoken delivery is best-effort (bounded by a timeout, wrapped so a failing TTS can't stall teardown). The error and close events still fire regardless, carrying the typed error for frontend handlers.

E. Docs, example, tests β€” docstrings on the new error + option, a quota_exceeded.py example, and unit tests.

Acceptance criteria

  • 429 inference_quota_exceeded is detectable via a typed error (APIQuotaExceededError) and a documented body field.
  • By default (or via the one-line error_message option) the user gets a perceptible signal β€” never pure silence.
  • A terminal quota error surfaces on the first occurrence, not after max_unrecoverable_errors.
  • Docs include the recipe + body contract; an example demonstrates it; tests cover 429-quota β†’ surfaced + spoken.
  • Agent Builder preview rendering β€” out of scope here (first-party frontend); the typed error/close events now carry the structured info needed to render it.

Test plan

  • uv run pytest tests/test_exceptions.py tests/test_unrecoverable_error.py --unit β€” new tests for the typed error, the inference-plugin conversion, surface-on-first-occurrence, and the spoken fallback (default hint / generic / disabled / custom).
  • make check (ruff format + lint + strict mypy) passes.
  • Full uv run pytest --unit suite: 832 passed, 4 skipped (the 9 test_room.py errors are pre-existing FileNotFoundError from a missing local server binary, unrelated to this change).

Notes / follow-ups

  • The typed error covers quota_type llm/stt/tts/bargein, but only the LLM HTTP path is wired today β€” the inference STT/TTS WebSocket paths don't surface the JSON body on connect, so they'd need separate handling.
  • Forwarding the structured error over the wire for the Agent Builder preview is a separate, frontend-facing change.

πŸ€– Generated with Claude Code

When the LLM endpoint returns HTTP 429 with body `inference_quota_exceeded`
(e.g. a project is out of LiveKit Inference credits), a voice agent would
join, publish its track, and then never speak β€” absorbing three silent dead
turns before closing with no audible or visible signal. The diagnosis
(`type`, `hint`, `quota_type`) was in the response body but never surfaced.

This makes the failure perceptible:

- Add `APIQuotaExceededError(APIStatusError)`, carrying `quota_type`,
  `category`, `hint`, and `remaining_limit` decoded from the gateway body.
  It defaults to `retryable=False` (quota exhaustion won't recover on an
  immediate retry). `create_api_error_from_http` and a `from_response`
  factory build it whenever the body is a quota payload.
- The inference LLM plugin (also the base for the OpenAI-compatible plugin)
  raises the typed error on a 429 quota body.
- `AgentSession` now surfaces a terminal quota error on the FIRST occurrence
  rather than after `max_unrecoverable_errors`, and speaks a fallback line
  before closing. New `error_message` option: omit it to speak the gateway
  `hint` for quota errors (generic fallback otherwise), pass a string to
  speak it on any unrecoverable error, or pass `None` to disable. Spoken
  delivery is best-effort and never blocks teardown.
- Add `quota_exceeded.py` example, plus unit tests for the typed error,
  the plugin conversion, and the surface/speak behavior.

Fixes #6009

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@chenghao-mou chenghao-mou requested a review from a team June 8, 2026 17:53

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 3 additional findings in Devin Review.

Open in Devin Review

Comment on lines +1519 to +1527
if not terminal:
if error.type == "llm_error":
self._llm_error_counts += 1
if self._llm_error_counts <= self.conn_options.max_unrecoverable_errors:
return
elif error.type == "tts_error":
self._tts_error_counts += 1
if self._tts_error_counts <= self.conn_options.max_unrecoverable_errors:
return

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Unrecoverable STT errors still close the session on first occurrence

The _on_error method at livekit-agents/livekit/agents/voice/agent_session.py:1519-1527 only counts and tolerates repeated errors for llm_error and tts_error types. An unrecoverable stt_error or realtime_model_error falls through both branches and immediately triggers session close β€” even if terminal is False. This is pre-existing behavior (unchanged by this PR) and may be intentional (STT errors are rarer and harder to recover from), but it means a single transient STT failure that's marked non-recoverable will close the session without the max_unrecoverable_errors tolerance that LLM/TTS get.

Open in Devin Review

Was this helpful? React with πŸ‘ or πŸ‘Ž to provide feedback.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct that this is pre-existing β€” stt_error and realtime_model_error fall through to close on the first unrecoverable occurrence, and this PR preserves that (the new terminal short-circuit sits in front of the unchanged llm/tts counters). It's arguably intentional: an STT stream persists for the whole agent lifetime (unlike LLM/TTS, which are recreated per response), so it isn't recreated to "retry" the way the tolerance assumes.

This PR does make a clean future fix possible: _on_error now keys off the generic APIError.terminal flag, so extending max_unrecoverable_errors tolerance to STT/realtime (or marking specific errors terminal) is a localized change. Leaving it out of scope here to keep the diff focused on the quota issue.

@@ -0,0 +1,80 @@
import logging

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we need an example for gateway errors. Maybe we should put this in docs instead?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. For now I've kept the example but fixed a real bug in it (it used ev.error where it needs ev.error.error β€” ErrorEvent.error is the LLMError/STTError wrapper, so the isinstance guard was always False). The minimal @session.on("error") recipe also lives in the APIQuotaExceededError docstring.

Happy to delete the example and move it to the docs site instead if you'd prefer that β€” just say the word and I'll drop examples/voice_agents/quota_exceeded.py + its README entry.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought, we can leave it here so we have some example for other error handling as well. This is hard to document for all other vendors.

remaining_limit: str | None = None,
) -> None:
# quota exhaustion won't recover on an immediate retry
if retryable is None:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we differentiate retry behaviour based on concurrency/RPM quotas vs credits difference?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in e7357c6 β€” APIQuotaExceededError now derives retryable/terminal from category (verified against agent-gateway/pkg/quota/response.go::quotaHint):

  • Terminal + non-retryable: credit exhaustion β€” MaxGatewayCredits, MaxBargeInRequests ("Wait for the next billing cycle…").
  • Retryable + non-terminal: rate/concurrency limits β€” MaxConcurrentGatewayLLMRpm/Tpm, MaxConcurrentGatewaySTT/TTS, MaxBargeInRPM, etc. (and any unknown/missing category, to avoid regressions).

So a transient rate-limit 429 is now retried with backoff by the stream and falls through max_unrecoverable_errors exactly as before this PR; only true credit exhaustion closes on the first turn. Added tests for both classes (retried vs terminal).

ivr_detection: bool = False,
user_away_timeout: float | None = 15.0,
session_close_transcript_timeout: float = 2.0,
error_message: NotGivenOr[str | None] = NOT_GIVEN,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the docs, we recommend using pre-recorded audio for errors like this: https://docs.livekit.io/reference/agents/events/#pre-recorded-audio
so that out of quota from TTS won't block this.

Should we just introduce APIError.terminal in addition to retryable so users can use the same error handling easier (so that this behaviour is not limited to quota errors)?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both addressed in e7357c6:

APIError.terminal β€” added a generic terminal: bool = False on APIError (independent of retryable: retryable governs in-request retries, terminal governs whether higher-level loops should give up at once). AgentSession._on_error now keys off error.error.terminal instead of isinstance(..., APIQuotaExceededError), so any terminal error (incl. future 401 bad-key / 400 bad-request cases) can short-circuit the max_unrecoverable_errors tolerance without further session changes. APIQuotaExceededError sets it from category.

Pre-recorded audio β€” documented the limitation: the error_message fallback synthesizes through the configured TTS, so it can't be heard if TTS itself is the exhausted/failed resource. The error_message docstring and a code comment in _try_speak_error_message now point to @session.on("error") + session.say(..., audio=...) with pre-recorded audio (linking the events docs) for a TTS-resilient signal.

@shawnfeldman shawnfeldman left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

Reviewed the diff against the actual gateway contract in agent-gateway (pkg/quota/), the base LLM retry loop, and the existing reviewer comments. The shape is solid and the spoken-fallback machinery is carefully guarded; one classification issue is worth fixing before merge.

Important

1. Transient rate-limit 429s get misclassified as terminal + non-retryable (regression)

APIQuotaExceededError defaults retryable=False for every inference_quota_exceeded body (_exceptions.py:157-158), and AgentSession._on_error treats every instance as terminal, closing on the first occurrence (agent_session.py:1517). But the gateway emits type: inference_quota_exceeded for two very different classes of 429:

category meaning recovers?
MaxGatewayCredits (+ MaxBargeInRequests) credit / quota exhausted only at next billing cycle β†’ terminal
MaxConcurrentGatewayLLMRpm, MaxConcurrentGatewayLLMTpm (+ STT/TTS/bargein concurrency) per-minute rate / concurrency limit within ~a minute via backoff β†’ transient

Authoritative source: agent-gateway/pkg/quota/check.go:68-79 and pkg/quota/response.go:55-77 β€” the hint for the rate-limit categories literally says "Reduce request rate … or upgrade", vs "Wait for the next billing cycle" for credits.

The live impact today is on the LLM path (the only one wired): before this PR, a pre-stream 429 raised APIStatusError(retryable=True) (inference/llm.py:395; 429 is exempt from the 4xxβ†’non-retryable rule in APIStatusError.__init__), so the base stream retried it with backoff (llm/llm.py:215-262, gated on e.retryable). After this PR a single MaxConcurrentGatewayLLMRpm/…Tpm spike is now (a) non-retryable and (b) kills the session permanently on the first turn β€” and speaks an "out of credits / temporarily unavailable" line for what is actually a recoverable rate-limit blip. That's exactly what @chenghao-mou flagged on _exceptions.py.

The class already knows about the split β€” the category docstring (_exceptions.py:134-135) calls out MaxConcurrentGatewayLLMRpm as a "rate-limit variant" β€” it just doesn't act on it. Suggested fix: derive retryable/terminal from category rather than from the body type alone. Only the credit/quota-exhaustion categories (MaxGatewayCredits, MaxBargeInRequests) should be retryable=False + terminal; the rate/concurrency categories should stay retryable and fall through the existing max_unrecoverable_errors tolerance. Then have _on_error key off that derived flag instead of isinstance(..., APIQuotaExceededError).

Suggestions

2. The spoken fallback can't be heard when TTS itself is the exhausted resource. When quota_type == "tts" (MaxConcurrentGatewayTTS), _try_speak_error_message tries to say() through the very resource that's rate-limited. Latent today (only the LLM path is wired, so quota_type == "llm" and the separate TTS still works), but it becomes real once the STT/TTS WS paths surface the body β€” your own follow-up note. As @chenghao-mou said, the pre-recorded audio path is the robust signal there; worth a code comment acknowledging the limitation.

3. Echoing @chenghao-mou: consider a generic APIError.terminal. Hardcoding isinstance(error.error, APIQuotaExceededError) in the session couples teardown policy to one exception subclass. A terminal attribute on APIError (default False) would be the natural home for the credits-vs-rate-limit bit from #1, and would also let other genuinely terminal errors (401 bad key, 400 bad request) short-circuit the tolerance loop without further session changes.

4. Test the transient case. Both new test files only exercise MaxGatewayCredits bodies β€” i.e. the case the PR handles correctly. Adding a MaxConcurrentGatewayLLMRpm body that asserts retryable is True / non-terminal is what would have caught #1.

5. Example vs docs placement (@chenghao-mou) β€” defer to maintainer convention; no strong opinion.

Strengths

  • Typed error is clean: body backfill, from_response factory, exported from livekit.agents, and explicit fields correctly take precedence over the body (nicely tested).
  • Best-effort spoken fallback is well-guarded β€” 16s playout timeout + try/except + _activity/output.audio None checks mean a failing TTS can't stall teardown; error/close still fire regardless.
  • Default path is backwards-compatible (silent unless a quota error or an explicit error_message).
  • Surfacing true credit exhaustion on the first occurrence instead of after 3 dead turns is the right call; docstrings, example, and tests are thorough and make check is green.

πŸ€– Reviewed with Claude Code β€” gateway contract cross-checked against agent-gateway/pkg/quota.

Address review feedback on #6012:

- Regression fix (@chenghao-mou, gateway-contract review): the gateway returns
  `inference_quota_exceeded` for BOTH terminal credit exhaustion
  (MaxGatewayCredits, MaxBargeInRequests) and transient rate/concurrency limits
  (MaxConcurrentGatewayLLMRpm/Tpm, …). The first cut made every quota 429
  non-retryable + terminal, which killed the session on the first turn and spoke
  "out of credits" for a recoverable rate-limit blip. Now APIQuotaExceededError
  derives retryable/terminal from `category`: only credit-exhaustion categories
  are terminal + non-retryable; everything else (and unknown categories) stays
  retryable + non-terminal and falls through the existing tolerance.

- Add a generic `APIError.terminal` flag (default False) and key
  AgentSession._on_error off it instead of `isinstance(APIQuotaExceededError)`,
  so any terminal error can short-circuit the tolerance loop (@chenghao-mou).

- Fix the consumer-facing snippets: `ErrorEvent.error` is the LLMError/STTError
  wrapper, so the underlying exception is at `ev.error.error`. Corrected the
  APIQuotaExceededError docstring example and examples/voice_agents/quota_exceeded.py
  (the isinstance guard was always False before).

- Document the TTS-quota limitation: the spoken fallback synthesizes through TTS,
  so for a TTS-resilient signal use @session.on("error") + say(..., audio=...)
  with pre-recorded audio.

- Tests: rate-limit body is retryable + non-terminal and is retried by the stream
  / tolerated by the session; unknown category defaults to transient; custom
  error_message overrides the hint; realtime quota path; spoken fallback survives
  a failing TTS; no-audio-output path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@shawnfeldman

Copy link
Copy Markdown
Author

Thanks for the thorough review (and the gateway-contract cross-check) β€” pushed e7357c6 addressing it.

Fixed

1. Credit-vs-rate-limit regression (the important one). The gateway returns inference_quota_exceeded for two different conditions, and the first cut treated them identically (non-retryable + terminal), which would kill the session on turn 1 and speak "out of credits" for a recoverable rate-limit blip. APIQuotaExceededError now derives retryable/terminal from category (verified against agent-gateway/pkg/quota/response.go::quotaHint):

category class retryable terminal
MaxGatewayCredits, MaxBargeInRequests credit exhaustion ❌ βœ…
MaxConcurrentGatewayLLMRpm/Tpm, …STT/TTS, MaxBargeInRPM, … rate / concurrency βœ… ❌
unknown / missing (safe default) βœ… ❌

So transient limits are retried with backoff and fall through max_unrecoverable_errors exactly as before; only true credit exhaustion surfaces on the first turn.

2. Generic APIError.terminal. Added terminal: bool = False on APIError (independent of retryable). AgentSession._on_error keys off error.error.terminal instead of isinstance(..., APIQuotaExceededError), so any terminal error can short-circuit the tolerance loop in future without further session changes.

3. ev.error β†’ ev.error.error. ErrorEvent.error is the LLMError/STTError/… wrapper; the underlying exception is at .error. The docstring example and quota_exceeded.py had the guard one level too shallow (always False) β€” fixed both.

4. TTS-quota / pre-recorded audio. Documented that the spoken fallback synthesizes through TTS (so it can't be heard if TTS itself is the failed resource); the docstring + a code comment point to the @session.on("error") + say(..., audio=...) pre-recorded-audio recipe.

Tests added

Rate-limit body is retryable + non-terminal and is retried by the stream / tolerated by the session; unknown category defaults to transient; custom error_message overrides the hint; realtime quota path; spoken fallback survives a failing TTS; no-audio-output path. make check green; full --unit suite passes (the test_room.py errors are a pre-existing missing-binary issue).

Open question

  • Example vs docs placement (@chenghao-mou): kept + fixed for now; glad to move it to the docs site and drop the example file if you prefer.
  • STT/realtime first-occurrence close (Devin): pre-existing and arguably intentional (STT streams persist for the agent lifetime); the new terminal flag makes a future tolerance change localized. Left out of scope here.

@shawnfeldman shawnfeldman requested a review from chenghao-mou June 9, 2026 15:48

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 4 additional findings in Devin Review.

Open in Devin Review

Comment on lines +300 to +304
quota_error = APIQuotaExceededError.from_response(
display, status_code=status, request_id=request_id, body=body
)
if quota_error is not None:
return quota_error

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Inference STT/TTS don't pass response body to create_api_error_from_http

The inference STT and TTS plugins call create_api_error_from_http(e.message, status=e.status) without a body= parameter (e.g. livekit-agents/livekit/agents/inference/tts.py:494, livekit-agents/livekit/agents/inference/stt.py:890). Since the quota detection in create_api_error_from_http (livekit-agents/livekit/agents/_exceptions.py:300-304) relies on body being a dict with type == "inference_quota_exceeded", quota errors from the STT/TTS websocket connection path will never produce a typed APIQuotaExceededError β€” they'll remain plain APIStatusError. The LLM path works because it catches openai.APIStatusError which provides e.body as a parsed dict. This means that if STT or TTS hits a quota exhaustion, the agent won't get the terminal/immediate-close behavior or the spoken hint. This is likely a known limitation β€” aiohttp.ClientResponseError doesn't expose a parsed JSON body β€” but it's worth noting for future work.

Open in Devin Review

Was this helpful? React with πŸ‘ or πŸ‘Ž to provide feedback.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed and documented in 8411107. Traced the full path:

  • The gateway's limitsMiddleware rejects a quota-exhausted STT/TTS connection pre-upgrade (agent-gateway/pkg/middleware/limits.go β†’ HandleJSONError) with a JSON 429 body, setting only Content-Type (no useful headers).
  • On a failed handshake aiohttp raises WSServerHandshakeError, which on 3.14 exposes only status/message/headers β€” the response body is discarded. So there's no body= to pass at the connect site.

So it's a genuine limitation, but there's a sharper reason to leave STT/TTS as a plain (retryable) APIStatusError rather than guess: without the body's category the SDK can't tell terminal credit exhaustion from a transient rate limit. Forcing terminal/non-retryable on every STT/TTS 429 would reintroduce exactly the rate-limit regression fixed earlier in this PR; leaving it untyped keeps the safe (retryable β†’ existing tolerance) behavior.

Added a code comment at both connect sites (inference/tts.py, inference/stt.py) explaining this so it's discoverable. A real fix would need the gateway to surface the category another way (e.g. a response header aiohttp keeps, or a post-upgrade close frame) β€” tracked as future work.

The gateway's limits middleware rejects a quota-exhausted STT/TTS WebSocket
connection pre-upgrade with a JSON 429 body, but aiohttp discards a failed
handshake's response body (WSServerHandshakeError exposes only status/headers),
so the connect path can't pass body= to create_api_error_from_http. STT/TTS
quota errors therefore stay a plain (retryable) APIStatusError rather than a
typed APIQuotaExceededError.

Leaving them untyped is also the safe default: without the body's `category`
the SDK can't distinguish terminal credit exhaustion from a transient rate
limit, so it must not force terminal/non-retryable behavior. Documented at both
connect sites; typing STT/TTS quota errors is future work.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@shawnfeldman shawnfeldman left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

Reviewed with 6 specialized agents (code-reviewer, silent-failure-hunter, pr-test-analyzer, type-design-analyzer, comment-analyzer, code-simplifier); Critical/Important findings were independently verified against the gateway source and the pinned openai SDK before posting.

Critical Issues (must fix)

  1. [code-reviewer, silent-failure-hunter, pr-test-analyzer β€” independently converged; verified empirically] The production quota path never fires: e.body is a str, not the quota dict β€” livekit-agents/livekit/agents/inference/llm.py:461-471
    The openai SDK's _make_status_error (verified in 2.30.0, openai/_client.py:809) unwraps the body before constructing the exception: data = body.get("error", body) if is_mapping(body) else body. The gateway's 429 payload always carries a top-level "error" key whose value is a string (agent-gateway/pkg/http/error.go β€” Error string, no omitempty; set from q.Error.Error() in pkg/quota/response.go::QuotaExceededResponse). Reproduced with a gateway-shaped 429 through the real SDK:
    exception class: RateLimitError
    e.body type:     str   ('LLM token credit quota exceeded, category: MaxGatewayCredits, ...')
    isinstance dict: False
    re-parse works:  e.response.json()["type"] == "inference_quota_exceeded"
    
    So from_response(body=e.body) hits isinstance(body, dict) β†’ False and returns None: every real quota 429 degrades to a plain retryable APIStatusError β†’ retried β†’ absorbed under max_unrecoverable_errors β†’ the exact silent dead-turns of #6009 persist, while all new tests pass. The tests mask this by hand-constructing openai.APIStatusError(body["error"], response=response, body=body) with the full dict (tests/test_exceptions.py:162-163), bypassing the SDK's unwrapping that real traffic goes through.
    Fix: recover the full body before detection:
    except openai.APIStatusError as e:
        body: object = e.body
        if not isinstance(body, dict):
            try:
                body = e.response.json()
            except Exception:
                body = e.body
        quota_error = APIQuotaExceededError.from_response(
            e.message, status_code=e.status_code, request_id=e.request_id, body=body
        )
    And make the test fixture go through the SDK path so this regression class is caught, e.g. openai.AsyncOpenAI(api_key="x")._make_status_error_from_response(httpx.Response(429, json=quota_body, request=...)), or swap in a real client over httpx.MockTransport (httpx already imported, no new dep).

Important Issues (should fix)

  1. [type-design-analyzer β€” verified] Unvalidated wire-data extraction can crash inside the error handler β€” livekit-agents/livekit/agents/_exceptions.py:205-217
    The four body.get(...) reads are unvalidated Any. An unhashable category (e.g. a list) raises TypeError at the frozenset membership test β€” inside the except openai.APIStatusError block, degrading the typed path to a generic error. Non-str values (int remaining_limit, list hint) silently violate the str | None annotations (strict mypy can't see it) and a non-str hint can flow into session.say(). The gateway sends strings today, but inference.LLM(base_url=...) is user-pointable. Fix (~4 lines): coerce each field, e.g. quota_type = v if isinstance(v := body.get("quota_type"), str) else None.

  2. [comment-analyzer, code-simplifier β€” both found independently] The example's copy-paste snippet is a SyntaxError, and its handler mislabels transient errors β€” examples/voice_agents/quota_exceeded.py

    • Lines 69-72: the commented await ctx.room.local_participant.set_attributes(...) sits inside the sync def on_error handler β€” uncommenting it is SyntaxError: 'await' outside async function. Show asyncio.create_task(...) (the pattern error_callback.py already uses).
    • Line 57: # quota errors are non-retryable; they will fail identically every turn β€” wrong for half the class. Rate-limit-category APIQuotaExceededErrors are retryable/non-terminal, and this handler does receive them (the LLM base emits an error event per retry attempt; agent_activity._on_error forwards every LLMError unconditionally). As written, the example logs/forwards "out of credits" for transient rate limits. Gate on err.terminal and fix the comment.
    • Lines 25-26: "By default such an error makes the agent join the room ... and then never speak" β€” stale; this PR changed that default, and it contradicts bullet (1) three lines later. Rephrase as before/after.
  3. [comment-analyzer β€” verified] Comment/docstring claims that contradict the code β€”

    • _exceptions.py:155-156: "By default AgentSession already speaks the hint and closes on the first occurrence" β€” only true for terminal instances with error_message unset; a transient rate-limit instance of this same class does neither. Add the qualifier.
    • inference/llm.py:462-463: "surface it as a typed, non-retryable error" β€” wrong for rate-limit bodies; the PR's own test_inference_llm_retries_rate_limit_429 asserts the retry. Reword.
    • voice/agent_session.py:345: https://docs.livekit.io/reference/agents/events/#pre-recorded-audio appears to be a guessed deep link β€” no docs.livekit.io/reference/... URL exists anywhere in the repo (the established pattern is docs.livekit.io/agents/...). Replace or drop before it 404-rots in a public docstring.

Suggestions

  1. [silent-failure-hunter] Non-terminal closes still end in dead air by default β€” agent_session.py:1561-1567. Sustained rate-limiting (or any generic LLM/TTS error) past the tolerance closes the session with nothing spoken β€” same end-user symptom as #6009. That matches the PR's stated scope ("other errors stay silent"), but the inline comment's justification ("speaking ... for a recoverable blip would be misleading") is wrong in context: _resolve_error_message is only reached when the session is irrevocably closing β€” the blip case never gets there. Either speak DEFAULT_ERROR_MESSAGE on any error-close by default, or fix the comment and note the limitation in the error_message docstring.
  2. [silent-failure-hunter] The speak-path warning can't fire in its dominant failure mode β€” agent_session.py:1590-1596. SpeechHandle._done_fut is only ever resolved with set_result(None), so a TTS that fails synthesis (e.g. the same depleted project) returns from wait_for_playout() normally β€” the except Exception log never fires, and the subsequent TTSError is dropped by the _closing_task guard. Log at info before speaking; log at warning when _on_error drops an error while _closing_task is set.
  3. [silent-failure-hunter] Document the STT-first ordering caveat β€” for a fully-depleted all-inference project (the configuration the example itself promotes), the STT WS connect 429s first, STT errors have no tolerance counter, and the session closes untyped/silent within seconds β€” before the LLM path and the spoken hint ever engage. The stt.py/tts.py NOTEs are accurate per-path, but this end-to-end consequence is documented nowhere; worth a sentence in the error_message docstring and the example.
  4. [pr-test-analyzer] Test gaps worth closing: the 16s _ERROR_MESSAGE_PLAYOUT_TIMEOUT is never exercised (a hanging-TTS test with the constant monkeypatched would pin it, rated 7/10); tolerance accumulation is unpinned (two errors with max_unrecoverable_errors=1, 6/10); error_message="" silently behaves like None, contradicting the docstring (6/10 β€” document or normalize); _spy_say discards kwargs so add_to_chat_ctx=False is unpinned (a regression would write error messages into chat history).
  5. [type-design-analyzer] Polish: add terminal to APIError.__repr__/APIStatusError.__repr__ (debugging "why did my session close on turn 1" wants it); one docstring line that from_response detection is body-keyed (status intentionally ignored); decide terminal vs permanent naming now β€” free before merge, impossible after in a public SDK (keeping terminal is defensible).
  6. [code-simplifier] Tests: build _rate_limit_error's body as {**INFERENCE_QUOTA_BODY, "category": ..., "hint": ...} so the category delta is visible by construction; optionally extract the repeated emit-error/await-close plumbing (~6 lines Γ— 9 tests) into two helpers, matching test_agent_session.py style.

Strengths

  • The categoryβ†’terminal/retryable taxonomy is exactly right per the gateway source (MaxGatewayCredits/MaxBargeInRequests terminal; everything else transient), and unknown categories failing open to transient is the correct version-skew posture β€” pinned by a dedicated test.
  • The session-level tests are genuinely behavioral (close events, reason, error identity, exact spoken text) and are the first coverage ever for the _on_error close path, including the realtime path and TTS-failure-during-farewell resilience.
  • _close_on_error's try/finally + bounded playout + _aclose_impl's force-interrupt means teardown is provably bounded even with a wedged TTS; concurrent aclose() is serialized safely.
  • The stt.py/tts.py NOTE comments are fully accurate (verified against aiohttp's handshake behavior and the gateway middleware), and the APIQuotaExceededError docstring with a working copy-pasteable example is unusually good.
  • APIError.terminal as a class attribute keeps reads safe on third-party subclasses that predate the PR β€” nice backward-compat detail.

πŸ€– Generated with Claude Code

Address review findings on #6012:

- The openai SDK narrows a mapping error body to its `error` value before
  raising (`_make_status_error`), and the gateway's `error` field is a bare
  string β€” so `e.body` never carried the `inference_quota_exceeded` payload
  and the typed quota path never fired against real traffic. Re-parse
  `e.response` to recover the full body, and build the test fixture through
  `_make_status_error_from_response` so the SDK's narrowing stays exercised
  (the quota test fails against the unfixed plugin).
- Coerce non-str quota body fields to None: an unhashable `category` raised
  TypeError at the frozenset membership test, and non-str values violated
  the `str | None` annotations (the endpoint is user-pointable).
- Example: gate the "out of credits" handler on `err.terminal` (the handler
  also receives transient rate-limit/retry events), fix the stale intro, and
  make the commented set_attributes snippet runnable from a sync handler.
- Docs: scope the APIQuotaExceededError docstring claim to terminal errors,
  reword the "non-retryable" comment in inference/llm.py, fix the
  pre-recorded-audio docs anchor.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@shawnfeldman

Copy link
Copy Markdown
Author

Pushed 4b82451 addressing the Critical + Important findings. make check green; full --unit suite 842 passed / 4 skipped (same pre-existing test_room.py missing-binary errors).

Fixed

1. e.body is a string in production (the Critical). Confirmed: the openai SDK's _make_status_error does body.get("error", body) before constructing the exception, and the gateway's error field is a bare string β€” so the typed quota path never fired on real traffic. The plugin now recovers the full payload via e.response.json() when e.body isn't a dict. The fixture in tests/test_exceptions.py now builds the 429 through the SDK's own _make_status_error_from_response (with a pinned assert not isinstance(status_error.body, dict)), so it exercises the narrowing real traffic goes through β€” verified it fails against the unfixed plugin exactly the way production did (string body β†’ plain retryable 429 β†’ retried β†’ absorbed silently).

2. Wire-data coercion. APIQuotaExceededError now drops non-str body fields (_str_or_none) β€” an unhashable category raised TypeError at the frozenset check, inside the except handler, and non-str values violated the str | None annotations. The endpoint is user-pointable, so the body is untrusted. New test pins it (unhashable category β†’ fields None, stays transient).

3. Example. on_error now gates on err.terminal (the handler also sees transient rate-limit/retry events β€” plain isinstance would report "out of credits" for a recoverable blip); rewrote the stale "by default … never speak" intro as before/after; the commented set_attributes snippet now spawns a task instead of awaiting inside the sync handler.

4. Doc accuracy. Scoped the APIQuotaExceededError docstring claim to terminal errors; reworded the "non-retryable" comment in inference/llm.py; the pre-recorded-audio docs link was actually the right page (the same one linked above) with a wrong fragment β€” fixed to #pre-recorded-audio-fallback.

Pushing back on the suggestions

  • Speak on every unrecoverable close by default β€” out of scope on purpose. That flips the default for every error class and every existing app (including ones already rendering their own error UX off the error event), not just the silent-quota case this PR targets. And on a transient-quota close, the only message available is the gateway hint ("Reduce request rate…"), which is developer-facing β€” the wrong thing to speak at an end user. error_message="..." is the supported one-liner for exactly that behavior.
  • Speak-path logging β€” the TTS base already warns per failed synthesis attempt, and _try_speak_error_message warns with traceback on timeout/sync failure; a session-level "speaking error message" info line is fine polish for a follow-up, not load-bearing here.
  • STT-first ordering caveat in the docstring β€” the connect-site NOTEs in inference/stt.py/tts.py are the canonical home for that mechanism and already describe it; copying gateway middleware-ordering details into the error_message docstring bakes server behavior into API docs that will drift. Typed STT/TTS quota errors stay the "future work" called out in the description.
  • Hanging-TTS timeout / tolerance-accumulation tests β€” agreed they'd be nice; the can't-stall invariant is already enforced by asyncio.wait_for + _aclose_impl's force-interrupt and covered by test_speaking_error_message_survives_tts_failure. Follow-up material rather than gating.
  • terminal vs permanent naming β€” keeping terminal: it's the name requested in review above, and it describes the consumer contract (the session gives up at once) rather than the error's persistence.
  • Smaller polish (terminal in __repr__, documenting error_message="", test-helper extraction) β€” happy to fold into a follow-up; skipped here to keep the diff reviewable.


return None

async def _close_on_error(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe instead of adding all these internal functions, we can add a default error hook (like what customers would do) if an error message is given?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Voice agent goes silently unresponsive on 429 inference_quota_exceeded β€” no audible/visible signal to the end user

2 participants