refactor(backends): move generate_from_raw hook firing into Backend base class#1264
Open
ajbozarth wants to merge 1 commit into
Open
Conversation
…ase class Closes generative-computing#1183. Folds the additive half of generative-computing#1218 Part 1. - `generate_from_raw` becomes a `@final` wrapper on `Backend` that owns pre/post/error hook firing; backends implement `_generate_from_raw` returning `(results, usage)`. - All five backends drop the inline gen_id, latency timing, and three hook-fire blocks; wrapper catches `BaseException` (matches chat-path generative-computing#1181). - `generation_batch_pre_call` payload mutations now propagate to the backend impl (model_options/format/tool_calls), matching the chat path. Adds `model_options` field to `GenerationBatchPreCallPayload`. Closes a gap from generative-computing#1218 where the batch hook was wired for telemetry observation only. - Standardizes `self._model_id` and `self._provider` on every backend. Inlines model-id resolution into ollama/huggingface `__init__`s and deletes the `_get_*_model_id` helpers; renames `_hf_model_id` to `_model_id`. - Backends with per-MOT token counts (ollama, hf, watsonx) now populate `mot.generation.usage` per MOT; openai/litellm leave it `None` since their APIs only report whole-batch usage. `mot._meta["usage"]` writes preserved for generative-computing#1218's follow-up. - Adds `TestGenerationBatchHookCallSites` mirroring the chat-path tests; updates the custom-backend doc snippet. Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request
Issue
Fixes #1183. Folds the additive half of #1218 Part 1.
Description
Brings the raw (batch) generation path in line with the chat path's
existing wrapper pattern:
Backend.generate_from_rawbecomes a@finalwrapper that owns hook firing, and backends implement only the new
_generate_from_rawabstract. The threegeneration_batch_*hooks nowfire from one place instead of being duplicated inline in all five
backends.
Reviewer call-outs:
Tuple return on
_generate_from_raw. Backends returntuple[list[ModelOutputThunk], dict | None]—(results, usage). Thewrapper unpacks the tuple, fires post_call with the aggregate, and
returns just
resultsto callers (public signature unchanged).Considered moving usage aggregation into the wrapper, but openai and
litellm only report whole-batch usage at the response root, so the
impl is the only place that knows the right shape. Matches the
existing
_generate_from_context -> tuple[MOT, Context]precedent.Standardized
self._model_idandself._providerinstead of newabstracts. The wrapper needs
modelandproviderfor hookpayloads (pre_call has no MOT yet; error may have no MOT). Considered
adding
_provider_name/_resolved_model_id()abstract members.Instead, finished the partial convention 3 of 5 backends already
used: every backend now sets
self._model_id: strandself._provider: strin__init__. Inlines model-id resolution intoollama and huggingface init bodies and deletes the now-unused
_get_*_model_idhelpers. RenamesLocalHFBackend._hf_model_idto_model_idfor consistency. ~10 chat-path setter sites switch toreading these attributes — no value changes, just one canonical
source per backend.
BaseExceptionon the raw-path error wrapper, matching chat. Thechat-path wrapper's
BaseExceptionwas added in refactor(telemetry)!: move backend tracing onto plugin/hook pattern #1181; the raw pathwas tentatively left at
except Exception. The same gap exists:synchronous
KeyboardInterrupt/asyncio.CancelledError/SystemExitinside the impl currently bypass the error hooksilently. This PR aligns to
BaseException. Behavior change: rawcancellation/interrupts now fire
generation_batch_errorbeforepropagating.
Batch pre_call payload mutations now propagate. When the batch
hooks were added in fix(backends): unify raw-path token usage on mot.generation.usage; guard eval_count=None #1218 they were wired for telemetry-only
observation: a plugin could mutate
model_options/format/tool_callson the pre_call payload, and the chat path would honorthose mutations, but the batch path silently dropped them. Same
plugin, same intent, different result depending on which API the
user reached. The wrapper now captures the post-hook payload and
reassigns the locals before calling
_generate_from_raw, identicalto the chat-path idiom. Adds
model_optionstoGenerationBatchPreCallPayload(it didn't exist on the batchpayload at all). Test inverted from
_are_not_propagatedto_propagates, mirroring the chat-path test.Folded the additive half of fix(backends): unify raw-path token usage on mot.generation.usage; guard eval_count=None #1218 Part 1. Backends with per-MOT
token counts (ollama, huggingface, watsonx) now also populate
mot.generation.usageper MOT. Backends with whole-batch-only usage(openai, litellm) leave per-MOT
mot.generation.usage = Noneandsurface the aggregate via the tuple return. Null-token policy:
all-or-nothing — if any of
prompt_tokens/completion_tokens/total_tokenscannot be determined for a MOT, that MOT'sgeneration.usagestaysNone(matchesdict | Nonetyping; ollamadocs document
Optional[int]as "not yet available" rather than"zero"). Watsonx's previous undocumented
or 0coerce switches tothe same policy.
The existing
mot._meta["usage"]writes are preserved everywhere —budget_forcing_alg.pystill reads from them. fix(backends): unify raw-path token usage on mot.generation.usage; guard eval_count=None #1218's remainingscope (consumer-side reads in
budget_forcing_alg.pyand themot._meta["usage"]deprecation path) stays open for follow-up.Originally folded in to support generic usage aggregation in the
wrapper; kept after the openai/litellm pivot to call-out 1's tuple
return because the per-MOT writes are still the right thing for the
3 backends whose APIs expose them.
Custom-backend doc updated.
docs/docs/community/building-extensions.mdwas teaching users tooverride the public
generate_from_context/generate_from_rawdirectly; those are
@finalnow. The example was updated toimplement
_generate_from_context/_generate_from_rawand set_model_id/_provider.Test coverage. Adds
TestGenerationBatchHookCallSitesintest/plugins/test_hook_call_sites.pymirroring the chat-pathfiring-site tests (9 tests).
_MockBackendextended to cover bothpaths with hardcoded behavior; error case via inline subclass per
the existing
RecordingBackend(_MockBackend)pattern. Mocks intest/stdlib/test_streaming.py,test/core/test_logger_plugin_hooks.py,and
test/stdlib/frameworks/test_react_framework.pyupdated to thenew abstract contract.
Testing
Attribution
Adding a new component, requirement, sampling strategy, or tool?
If your PR adds or modifies one of the types below, check the matching box. A checklist of type-specific review items will be posted as a comment.
NOTE: Please ensure you have an issue that has been acknowledged by a core contributor and routed you to open a pull request against this repository. Otherwise, please open an issue before continuing with this pull request.