feat(voice): surface inference quota errors instead of going silent#6012
feat(voice): surface inference quota errors instead of going silent#6012shawnfeldman wants to merge 4 commits into
Conversation
When the LLM endpoint returns HTTP 429 with body `inference_quota_exceeded` (e.g. a project is out of LiveKit Inference credits), a voice agent would join, publish its track, and then never speak β absorbing three silent dead turns before closing with no audible or visible signal. The diagnosis (`type`, `hint`, `quota_type`) was in the response body but never surfaced. This makes the failure perceptible: - Add `APIQuotaExceededError(APIStatusError)`, carrying `quota_type`, `category`, `hint`, and `remaining_limit` decoded from the gateway body. It defaults to `retryable=False` (quota exhaustion won't recover on an immediate retry). `create_api_error_from_http` and a `from_response` factory build it whenever the body is a quota payload. - The inference LLM plugin (also the base for the OpenAI-compatible plugin) raises the typed error on a 429 quota body. - `AgentSession` now surfaces a terminal quota error on the FIRST occurrence rather than after `max_unrecoverable_errors`, and speaks a fallback line before closing. New `error_message` option: omit it to speak the gateway `hint` for quota errors (generic fallback otherwise), pass a string to speak it on any unrecoverable error, or pass `None` to disable. Spoken delivery is best-effort and never blocks teardown. - Add `quota_exceeded.py` example, plus unit tests for the typed error, the plugin conversion, and the surface/speak behavior. Fixes #6009 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| if not terminal: | ||
| if error.type == "llm_error": | ||
| self._llm_error_counts += 1 | ||
| if self._llm_error_counts <= self.conn_options.max_unrecoverable_errors: | ||
| return | ||
| elif error.type == "tts_error": | ||
| self._tts_error_counts += 1 | ||
| if self._tts_error_counts <= self.conn_options.max_unrecoverable_errors: | ||
| return |
There was a problem hiding this comment.
π© Unrecoverable STT errors still close the session on first occurrence
The _on_error method at livekit-agents/livekit/agents/voice/agent_session.py:1519-1527 only counts and tolerates repeated errors for llm_error and tts_error types. An unrecoverable stt_error or realtime_model_error falls through both branches and immediately triggers session close β even if terminal is False. This is pre-existing behavior (unchanged by this PR) and may be intentional (STT errors are rarer and harder to recover from), but it means a single transient STT failure that's marked non-recoverable will close the session without the max_unrecoverable_errors tolerance that LLM/TTS get.
Was this helpful? React with π or π to provide feedback.
There was a problem hiding this comment.
Correct that this is pre-existing β stt_error and realtime_model_error fall through to close on the first unrecoverable occurrence, and this PR preserves that (the new terminal short-circuit sits in front of the unchanged llm/tts counters). It's arguably intentional: an STT stream persists for the whole agent lifetime (unlike LLM/TTS, which are recreated per response), so it isn't recreated to "retry" the way the tolerance assumes.
This PR does make a clean future fix possible: _on_error now keys off the generic APIError.terminal flag, so extending max_unrecoverable_errors tolerance to STT/realtime (or marking specific errors terminal) is a localized change. Leaving it out of scope here to keep the diff focused on the quota issue.
| @@ -0,0 +1,80 @@ | |||
| import logging | |||
There was a problem hiding this comment.
I am not sure if we need an example for gateway errors. Maybe we should put this in docs instead?
There was a problem hiding this comment.
Fair point. For now I've kept the example but fixed a real bug in it (it used ev.error where it needs ev.error.error β ErrorEvent.error is the LLMError/STTError wrapper, so the isinstance guard was always False). The minimal @session.on("error") recipe also lives in the APIQuotaExceededError docstring.
Happy to delete the example and move it to the docs site instead if you'd prefer that β just say the word and I'll drop examples/voice_agents/quota_exceeded.py + its README entry.
There was a problem hiding this comment.
On a second thought, we can leave it here so we have some example for other error handling as well. This is hard to document for all other vendors.
| remaining_limit: str | None = None, | ||
| ) -> None: | ||
| # quota exhaustion won't recover on an immediate retry | ||
| if retryable is None: |
There was a problem hiding this comment.
should we differentiate retry behaviour based on concurrency/RPM quotas vs credits difference?
There was a problem hiding this comment.
Done in e7357c6 β APIQuotaExceededError now derives retryable/terminal from category (verified against agent-gateway/pkg/quota/response.go::quotaHint):
- Terminal + non-retryable: credit exhaustion β
MaxGatewayCredits,MaxBargeInRequests("Wait for the next billing cycleβ¦"). - Retryable + non-terminal: rate/concurrency limits β
MaxConcurrentGatewayLLMRpm/Tpm,MaxConcurrentGatewaySTT/TTS,MaxBargeInRPM, etc. (and any unknown/missing category, to avoid regressions).
So a transient rate-limit 429 is now retried with backoff by the stream and falls through max_unrecoverable_errors exactly as before this PR; only true credit exhaustion closes on the first turn. Added tests for both classes (retried vs terminal).
| ivr_detection: bool = False, | ||
| user_away_timeout: float | None = 15.0, | ||
| session_close_transcript_timeout: float = 2.0, | ||
| error_message: NotGivenOr[str | None] = NOT_GIVEN, |
There was a problem hiding this comment.
In the docs, we recommend using pre-recorded audio for errors like this: https://docs.livekit.io/reference/agents/events/#pre-recorded-audio
so that out of quota from TTS won't block this.
Should we just introduce APIError.terminal in addition to retryable so users can use the same error handling easier (so that this behaviour is not limited to quota errors)?
There was a problem hiding this comment.
Both addressed in e7357c6:
APIError.terminal β added a generic terminal: bool = False on APIError (independent of retryable: retryable governs in-request retries, terminal governs whether higher-level loops should give up at once). AgentSession._on_error now keys off error.error.terminal instead of isinstance(..., APIQuotaExceededError), so any terminal error (incl. future 401 bad-key / 400 bad-request cases) can short-circuit the max_unrecoverable_errors tolerance without further session changes. APIQuotaExceededError sets it from category.
Pre-recorded audio β documented the limitation: the error_message fallback synthesizes through the configured TTS, so it can't be heard if TTS itself is the exhausted/failed resource. The error_message docstring and a code comment in _try_speak_error_message now point to @session.on("error") + session.say(..., audio=...) with pre-recorded audio (linking the events docs) for a TTS-resilient signal.
shawnfeldman
left a comment
There was a problem hiding this comment.
PR Review Summary
Reviewed the diff against the actual gateway contract in agent-gateway (pkg/quota/), the base LLM retry loop, and the existing reviewer comments. The shape is solid and the spoken-fallback machinery is carefully guarded; one classification issue is worth fixing before merge.
Important
1. Transient rate-limit 429s get misclassified as terminal + non-retryable (regression)
APIQuotaExceededError defaults retryable=False for every inference_quota_exceeded body (_exceptions.py:157-158), and AgentSession._on_error treats every instance as terminal, closing on the first occurrence (agent_session.py:1517). But the gateway emits type: inference_quota_exceeded for two very different classes of 429:
| category | meaning | recovers? |
|---|---|---|
MaxGatewayCredits (+ MaxBargeInRequests) |
credit / quota exhausted | only at next billing cycle β terminal |
MaxConcurrentGatewayLLMRpm, MaxConcurrentGatewayLLMTpm (+ STT/TTS/bargein concurrency) |
per-minute rate / concurrency limit | within ~a minute via backoff β transient |
Authoritative source: agent-gateway/pkg/quota/check.go:68-79 and pkg/quota/response.go:55-77 β the hint for the rate-limit categories literally says "Reduce request rate β¦ or upgrade", vs "Wait for the next billing cycle" for credits.
The live impact today is on the LLM path (the only one wired): before this PR, a pre-stream 429 raised APIStatusError(retryable=True) (inference/llm.py:395; 429 is exempt from the 4xxβnon-retryable rule in APIStatusError.__init__), so the base stream retried it with backoff (llm/llm.py:215-262, gated on e.retryable). After this PR a single MaxConcurrentGatewayLLMRpm/β¦Tpm spike is now (a) non-retryable and (b) kills the session permanently on the first turn β and speaks an "out of credits / temporarily unavailable" line for what is actually a recoverable rate-limit blip. That's exactly what @chenghao-mou flagged on _exceptions.py.
The class already knows about the split β the category docstring (_exceptions.py:134-135) calls out MaxConcurrentGatewayLLMRpm as a "rate-limit variant" β it just doesn't act on it. Suggested fix: derive retryable/terminal from category rather than from the body type alone. Only the credit/quota-exhaustion categories (MaxGatewayCredits, MaxBargeInRequests) should be retryable=False + terminal; the rate/concurrency categories should stay retryable and fall through the existing max_unrecoverable_errors tolerance. Then have _on_error key off that derived flag instead of isinstance(..., APIQuotaExceededError).
Suggestions
2. The spoken fallback can't be heard when TTS itself is the exhausted resource. When quota_type == "tts" (MaxConcurrentGatewayTTS), _try_speak_error_message tries to say() through the very resource that's rate-limited. Latent today (only the LLM path is wired, so quota_type == "llm" and the separate TTS still works), but it becomes real once the STT/TTS WS paths surface the body β your own follow-up note. As @chenghao-mou said, the pre-recorded audio path is the robust signal there; worth a code comment acknowledging the limitation.
3. Echoing @chenghao-mou: consider a generic APIError.terminal. Hardcoding isinstance(error.error, APIQuotaExceededError) in the session couples teardown policy to one exception subclass. A terminal attribute on APIError (default False) would be the natural home for the credits-vs-rate-limit bit from #1, and would also let other genuinely terminal errors (401 bad key, 400 bad request) short-circuit the tolerance loop without further session changes.
4. Test the transient case. Both new test files only exercise MaxGatewayCredits bodies β i.e. the case the PR handles correctly. Adding a MaxConcurrentGatewayLLMRpm body that asserts retryable is True / non-terminal is what would have caught #1.
5. Example vs docs placement (@chenghao-mou) β defer to maintainer convention; no strong opinion.
Strengths
- Typed error is clean: body backfill,
from_responsefactory, exported fromlivekit.agents, and explicit fields correctly take precedence over the body (nicely tested). - Best-effort spoken fallback is well-guarded β 16s playout timeout +
try/except+_activity/output.audioNone checks mean a failing TTS can't stall teardown;error/closestill fire regardless. - Default path is backwards-compatible (silent unless a quota error or an explicit
error_message). - Surfacing true credit exhaustion on the first occurrence instead of after 3 dead turns is the right call; docstrings, example, and tests are thorough and
make checkis green.
π€ Reviewed with Claude Code β gateway contract cross-checked against agent-gateway/pkg/quota.
Address review feedback on #6012: - Regression fix (@chenghao-mou, gateway-contract review): the gateway returns `inference_quota_exceeded` for BOTH terminal credit exhaustion (MaxGatewayCredits, MaxBargeInRequests) and transient rate/concurrency limits (MaxConcurrentGatewayLLMRpm/Tpm, β¦). The first cut made every quota 429 non-retryable + terminal, which killed the session on the first turn and spoke "out of credits" for a recoverable rate-limit blip. Now APIQuotaExceededError derives retryable/terminal from `category`: only credit-exhaustion categories are terminal + non-retryable; everything else (and unknown categories) stays retryable + non-terminal and falls through the existing tolerance. - Add a generic `APIError.terminal` flag (default False) and key AgentSession._on_error off it instead of `isinstance(APIQuotaExceededError)`, so any terminal error can short-circuit the tolerance loop (@chenghao-mou). - Fix the consumer-facing snippets: `ErrorEvent.error` is the LLMError/STTError wrapper, so the underlying exception is at `ev.error.error`. Corrected the APIQuotaExceededError docstring example and examples/voice_agents/quota_exceeded.py (the isinstance guard was always False before). - Document the TTS-quota limitation: the spoken fallback synthesizes through TTS, so for a TTS-resilient signal use @session.on("error") + say(..., audio=...) with pre-recorded audio. - Tests: rate-limit body is retryable + non-terminal and is retried by the stream / tolerated by the session; unknown category defaults to transient; custom error_message overrides the hint; realtime quota path; spoken fallback survives a failing TTS; no-audio-output path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks for the thorough review (and the gateway-contract cross-check) β pushed e7357c6 addressing it. Fixed1. Credit-vs-rate-limit regression (the important one). The gateway returns
So transient limits are retried with backoff and fall through 2. Generic 3. 4. TTS-quota / pre-recorded audio. Documented that the spoken fallback synthesizes through TTS (so it can't be heard if TTS itself is the failed resource); the docstring + a code comment point to the Tests addedRate-limit body is retryable + non-terminal and is retried by the stream / tolerated by the session; unknown category defaults to transient; custom Open question
|
| quota_error = APIQuotaExceededError.from_response( | ||
| display, status_code=status, request_id=request_id, body=body | ||
| ) | ||
| if quota_error is not None: | ||
| return quota_error |
There was a problem hiding this comment.
π© Inference STT/TTS don't pass response body to create_api_error_from_http
The inference STT and TTS plugins call create_api_error_from_http(e.message, status=e.status) without a body= parameter (e.g. livekit-agents/livekit/agents/inference/tts.py:494, livekit-agents/livekit/agents/inference/stt.py:890). Since the quota detection in create_api_error_from_http (livekit-agents/livekit/agents/_exceptions.py:300-304) relies on body being a dict with type == "inference_quota_exceeded", quota errors from the STT/TTS websocket connection path will never produce a typed APIQuotaExceededError β they'll remain plain APIStatusError. The LLM path works because it catches openai.APIStatusError which provides e.body as a parsed dict. This means that if STT or TTS hits a quota exhaustion, the agent won't get the terminal/immediate-close behavior or the spoken hint. This is likely a known limitation β aiohttp.ClientResponseError doesn't expose a parsed JSON body β but it's worth noting for future work.
Was this helpful? React with π or π to provide feedback.
There was a problem hiding this comment.
Confirmed and documented in 8411107. Traced the full path:
- The gateway's
limitsMiddlewarerejects a quota-exhausted STT/TTS connection pre-upgrade (agent-gateway/pkg/middleware/limits.goβHandleJSONError) with a JSON 429 body, setting onlyContent-Type(no useful headers). - On a failed handshake aiohttp raises
WSServerHandshakeError, which on 3.14 exposes onlystatus/message/headersβ the response body is discarded. So there's nobody=to pass at the connect site.
So it's a genuine limitation, but there's a sharper reason to leave STT/TTS as a plain (retryable) APIStatusError rather than guess: without the body's category the SDK can't tell terminal credit exhaustion from a transient rate limit. Forcing terminal/non-retryable on every STT/TTS 429 would reintroduce exactly the rate-limit regression fixed earlier in this PR; leaving it untyped keeps the safe (retryable β existing tolerance) behavior.
Added a code comment at both connect sites (inference/tts.py, inference/stt.py) explaining this so it's discoverable. A real fix would need the gateway to surface the category another way (e.g. a response header aiohttp keeps, or a post-upgrade close frame) β tracked as future work.
The gateway's limits middleware rejects a quota-exhausted STT/TTS WebSocket connection pre-upgrade with a JSON 429 body, but aiohttp discards a failed handshake's response body (WSServerHandshakeError exposes only status/headers), so the connect path can't pass body= to create_api_error_from_http. STT/TTS quota errors therefore stay a plain (retryable) APIStatusError rather than a typed APIQuotaExceededError. Leaving them untyped is also the safe default: without the body's `category` the SDK can't distinguish terminal credit exhaustion from a transient rate limit, so it must not force terminal/non-retryable behavior. Documented at both connect sites; typing STT/TTS quota errors is future work. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
shawnfeldman
left a comment
There was a problem hiding this comment.
PR Review Summary
Reviewed with 6 specialized agents (code-reviewer, silent-failure-hunter, pr-test-analyzer, type-design-analyzer, comment-analyzer, code-simplifier); Critical/Important findings were independently verified against the gateway source and the pinned openai SDK before posting.
Critical Issues (must fix)
- [code-reviewer, silent-failure-hunter, pr-test-analyzer β independently converged; verified empirically] The production quota path never fires:
e.bodyis astr, not the quota dict βlivekit-agents/livekit/agents/inference/llm.py:461-471
The openai SDK's_make_status_error(verified in 2.30.0,openai/_client.py:809) unwraps the body before constructing the exception:data = body.get("error", body) if is_mapping(body) else body. The gateway's 429 payload always carries a top-level"error"key whose value is a string (agent-gateway/pkg/http/error.goβError string, no omitempty; set fromq.Error.Error()inpkg/quota/response.go::QuotaExceededResponse). Reproduced with a gateway-shaped 429 through the real SDK:Soexception class: RateLimitError e.body type: str ('LLM token credit quota exceeded, category: MaxGatewayCredits, ...') isinstance dict: False re-parse works: e.response.json()["type"] == "inference_quota_exceeded"from_response(body=e.body)hitsisinstance(body, dict) β Falseand returnsNone: every real quota 429 degrades to a plain retryableAPIStatusErrorβ retried β absorbed undermax_unrecoverable_errorsβ the exact silent dead-turns of #6009 persist, while all new tests pass. The tests mask this by hand-constructingopenai.APIStatusError(body["error"], response=response, body=body)with the full dict (tests/test_exceptions.py:162-163), bypassing the SDK's unwrapping that real traffic goes through.
Fix: recover the full body before detection:And make the test fixture go through the SDK path so this regression class is caught, e.g.except openai.APIStatusError as e: body: object = e.body if not isinstance(body, dict): try: body = e.response.json() except Exception: body = e.body quota_error = APIQuotaExceededError.from_response( e.message, status_code=e.status_code, request_id=e.request_id, body=body )
openai.AsyncOpenAI(api_key="x")._make_status_error_from_response(httpx.Response(429, json=quota_body, request=...)), or swap in a real client overhttpx.MockTransport(httpx already imported, no new dep).
Important Issues (should fix)
-
[type-design-analyzer β verified] Unvalidated wire-data extraction can crash inside the error handler β
livekit-agents/livekit/agents/_exceptions.py:205-217
The fourbody.get(...)reads are unvalidatedAny. An unhashablecategory(e.g. a list) raisesTypeErrorat the frozenset membership test β inside theexcept openai.APIStatusErrorblock, degrading the typed path to a generic error. Non-str values (intremaining_limit, listhint) silently violate thestr | Noneannotations (strict mypy can't see it) and a non-strhintcan flow intosession.say(). The gateway sends strings today, butinference.LLM(base_url=...)is user-pointable. Fix (~4 lines): coerce each field, e.g.quota_type = v if isinstance(v := body.get("quota_type"), str) else None. -
[comment-analyzer, code-simplifier β both found independently] The example's copy-paste snippet is a SyntaxError, and its handler mislabels transient errors β
examples/voice_agents/quota_exceeded.py- Lines 69-72: the commented
await ctx.room.local_participant.set_attributes(...)sits inside the syncdef on_errorhandler β uncommenting it isSyntaxError: 'await' outside async function. Showasyncio.create_task(...)(the patternerror_callback.pyalready uses). - Line 57:
# quota errors are non-retryable; they will fail identically every turnβ wrong for half the class. Rate-limit-categoryAPIQuotaExceededErrors are retryable/non-terminal, and this handler does receive them (the LLM base emits anerrorevent per retry attempt;agent_activity._on_errorforwards everyLLMErrorunconditionally). As written, the example logs/forwards "out of credits" for transient rate limits. Gate onerr.terminaland fix the comment. - Lines 25-26: "By default such an error makes the agent join the room ... and then never speak" β stale; this PR changed that default, and it contradicts bullet (1) three lines later. Rephrase as before/after.
- Lines 69-72: the commented
-
[comment-analyzer β verified] Comment/docstring claims that contradict the code β
_exceptions.py:155-156: "By defaultAgentSessionalready speaks thehintand closes on the first occurrence" β only true for terminal instances witherror_messageunset; a transient rate-limit instance of this same class does neither. Add the qualifier.inference/llm.py:462-463: "surface it as a typed, non-retryable error" β wrong for rate-limit bodies; the PR's owntest_inference_llm_retries_rate_limit_429asserts the retry. Reword.voice/agent_session.py:345:https://docs.livekit.io/reference/agents/events/#pre-recorded-audioappears to be a guessed deep link β nodocs.livekit.io/reference/...URL exists anywhere in the repo (the established pattern isdocs.livekit.io/agents/...). Replace or drop before it 404-rots in a public docstring.
Suggestions
- [silent-failure-hunter] Non-terminal closes still end in dead air by default β
agent_session.py:1561-1567. Sustained rate-limiting (or any generic LLM/TTS error) past the tolerance closes the session with nothing spoken β same end-user symptom as #6009. That matches the PR's stated scope ("other errors stay silent"), but the inline comment's justification ("speaking ... for a recoverable blip would be misleading") is wrong in context:_resolve_error_messageis only reached when the session is irrevocably closing β the blip case never gets there. Either speakDEFAULT_ERROR_MESSAGEon any error-close by default, or fix the comment and note the limitation in theerror_messagedocstring. - [silent-failure-hunter] The speak-path warning can't fire in its dominant failure mode β
agent_session.py:1590-1596.SpeechHandle._done_futis only ever resolved withset_result(None), so a TTS that fails synthesis (e.g. the same depleted project) returns fromwait_for_playout()normally β theexcept Exceptionlog never fires, and the subsequentTTSErroris dropped by the_closing_taskguard. Log at info before speaking; log at warning when_on_errordrops an error while_closing_taskis set. - [silent-failure-hunter] Document the STT-first ordering caveat β for a fully-depleted all-inference project (the configuration the example itself promotes), the STT WS connect 429s first, STT errors have no tolerance counter, and the session closes untyped/silent within seconds β before the LLM path and the spoken hint ever engage. The stt.py/tts.py NOTEs are accurate per-path, but this end-to-end consequence is documented nowhere; worth a sentence in the
error_messagedocstring and the example. - [pr-test-analyzer] Test gaps worth closing: the 16s
_ERROR_MESSAGE_PLAYOUT_TIMEOUTis never exercised (a hanging-TTS test with the constant monkeypatched would pin it, rated 7/10); tolerance accumulation is unpinned (two errors withmax_unrecoverable_errors=1, 6/10);error_message=""silently behaves likeNone, contradicting the docstring (6/10 β document or normalize);_spy_saydiscards kwargs soadd_to_chat_ctx=Falseis unpinned (a regression would write error messages into chat history). - [type-design-analyzer] Polish: add
terminaltoAPIError.__repr__/APIStatusError.__repr__(debugging "why did my session close on turn 1" wants it); one docstring line thatfrom_responsedetection is body-keyed (status intentionally ignored); decideterminalvspermanentnaming now β free before merge, impossible after in a public SDK (keepingterminalis defensible). - [code-simplifier] Tests: build
_rate_limit_error's body as{**INFERENCE_QUOTA_BODY, "category": ..., "hint": ...}so the category delta is visible by construction; optionally extract the repeated emit-error/await-close plumbing (~6 lines Γ 9 tests) into two helpers, matchingtest_agent_session.pystyle.
Strengths
- The categoryβterminal/retryable taxonomy is exactly right per the gateway source (
MaxGatewayCredits/MaxBargeInRequeststerminal; everything else transient), and unknown categories failing open to transient is the correct version-skew posture β pinned by a dedicated test. - The session-level tests are genuinely behavioral (close events, reason, error identity, exact spoken text) and are the first coverage ever for the
_on_errorclose path, including the realtime path and TTS-failure-during-farewell resilience. _close_on_error's try/finally + bounded playout +_aclose_impl's force-interrupt means teardown is provably bounded even with a wedged TTS; concurrentaclose()is serialized safely.- The stt.py/tts.py NOTE comments are fully accurate (verified against aiohttp's handshake behavior and the gateway middleware), and the
APIQuotaExceededErrordocstring with a working copy-pasteable example is unusually good. APIError.terminalas a class attribute keeps reads safe on third-party subclasses that predate the PR β nice backward-compat detail.
π€ Generated with Claude Code
Address review findings on #6012: - The openai SDK narrows a mapping error body to its `error` value before raising (`_make_status_error`), and the gateway's `error` field is a bare string β so `e.body` never carried the `inference_quota_exceeded` payload and the typed quota path never fired against real traffic. Re-parse `e.response` to recover the full body, and build the test fixture through `_make_status_error_from_response` so the SDK's narrowing stays exercised (the quota test fails against the unfixed plugin). - Coerce non-str quota body fields to None: an unhashable `category` raised TypeError at the frozenset membership test, and non-str values violated the `str | None` annotations (the endpoint is user-pointable). - Example: gate the "out of credits" handler on `err.terminal` (the handler also receives transient rate-limit/retry events), fix the stale intro, and make the commented set_attributes snippet runnable from a sync handler. - Docs: scope the APIQuotaExceededError docstring claim to terminal errors, reword the "non-retryable" comment in inference/llm.py, fix the pre-recorded-audio docs anchor. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Pushed 4b82451 addressing the Critical + Important findings. Fixed1. 2. Wire-data coercion. 3. Example. 4. Doc accuracy. Scoped the Pushing back on the suggestions
|
|
|
||
| return None | ||
|
|
||
| async def _close_on_error( |
There was a problem hiding this comment.
maybe instead of adding all these internal functions, we can add a default error hook (like what customers would do) if an error message is given?
Summary
Fixes #6009 β a voice agent goes silently unresponsive when the LLM endpoint returns
HTTP 429 {"type": "inference_quota_exceeded", ...}(e.g. the project is out of LiveKit Inference credits). Today the agent joins the room, publishes its track, and then never speaks: STT transcribes fine, the reply produces no text, TTS is never invoked, and the session absorbs three silent dead turns before closing with no audible or visible signal. The gateway hands us everything we need (type,hint,quota_type) in the response body β we just never looked at it.This PR makes that failure perceptible by default.
What changed
A. Typed detection β new
APIQuotaExceededError(APIStatusError)exposingquota_type,category,hint, andremaining_limitdecoded from the gateway body. It defaults toretryable=Falsesince quota exhaustion won't recover on an immediate retry.create_api_error_from_httpand aAPIQuotaExceededError.from_response(...)factory construct it whenever the body is aninference_quota_exceededpayload. Exported fromlivekit.agents.B. Plugin wiring β the inference LLM plugin (also the base class for the OpenAI-compatible
livekit-plugins-openaiLLMStream) raises the typed error on a 429 quota body, soev.error/APIError.bodyare reliable.C. Surface on the first occurrence β
AgentSession._on_errorno longer absorbs a terminal quota error up tomax_unrecoverable_errors. The 3-strike tolerance still applies to transient errors; a known-terminal quota error closes immediately.D. Perceptible signal, by default β a new
AgentSession(error_message=...)option speaks a fallback line just before the session closes on an unrecoverable error:hint(generic fallback if absent); other errors stay silent β no behavior change for the generic path.error_message="...": speak that message on any unrecoverable error.error_message=None: disable spoken errors entirely.Spoken delivery is best-effort (bounded by a timeout, wrapped so a failing TTS can't stall teardown). The
errorandcloseevents still fire regardless, carrying the typed error for frontend handlers.E. Docs, example, tests β docstrings on the new error + option, a
quota_exceeded.pyexample, and unit tests.Acceptance criteria
429 inference_quota_exceededis detectable via a typed error (APIQuotaExceededError) and a documentedbodyfield.error_messageoption) the user gets a perceptible signal β never pure silence.max_unrecoverable_errors.error/closeevents now carry the structured info needed to render it.Test plan
uv run pytest tests/test_exceptions.py tests/test_unrecoverable_error.py --unitβ new tests for the typed error, the inference-plugin conversion, surface-on-first-occurrence, and the spoken fallback (default hint / generic / disabled / custom).make check(ruff format + lint + strict mypy) passes.uv run pytest --unitsuite: 832 passed, 4 skipped (the 9test_room.pyerrors are pre-existingFileNotFoundErrorfrom a missing local server binary, unrelated to this change).Notes / follow-ups
quota_typellm/stt/tts/bargein, but only the LLM HTTP path is wired today β the inference STT/TTS WebSocket paths don't surface the JSON body on connect, so they'd need separate handling.π€ Generated with Claude Code