feat(rpc): report why a tx-status request timed out#15950
Conversation
Transaction-status requests (`tx`, `EXPERIMENTAL_tx_status`, and the `wait_until` path of `send_tx`/`broadcast_tx_commit`) that time out before reaching the requested `wait_until` finality previously returned a context-free `TIMEOUT_ERROR`. They now return a `TIMEOUT_ERROR` whose `reason` explains what happened: - `pending`: the transaction was observed but is still below the requested finality; carries the last-known status so callers can re-poll. - `not_observed`: the transaction was never seen on chain. - `error`: the node could not produce a status before the timeout. Also refactors `tx_status_fetch` into focused helpers (`poll_tx_status`, `detect_invalid_tx`, `tx_status_on_timeout`). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #15950 +/- ##
==========================================
+ Coverage 72.46% 72.50% +0.03%
==========================================
Files 946 946
Lines 204323 204322 -1
Branches 204323 204322 -1
==========================================
+ Hits 148071 148144 +73
+ Misses 51299 51218 -81
- Partials 4953 4960 +7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Bump the spec version to 1.2.12 and regenerate it to include the new `TimeoutErrorReason` schema (`pending`/`not_observed`/`error`) on the `TIMEOUT_ERROR` of `RpcTransactionError`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR enhances the JSON-RPC transaction status APIs to preserve and return why a tx-status request timed out by attaching a structured reason to TIMEOUT_ERROR (not_observed, pending with last-known status, or error with debug info). It also refactors tx_status_fetch into smaller helpers and updates the OpenAPI spec accordingly.
Changes:
- Extend
RpcTransactionError::TimeoutErrorto includereason: TimeoutErrorReasonand add unit tests to validate serialization/round-tripping. - Refactor
tx_status_fetchpolling intopoll_tx_status,validate_tx, andtx_status_on_timeoutwhile recording a more informative timeout reason. - Update OpenAPI version + schema to document the new
TIMEOUT_ERROR.info.reasonpayload, and document the behavior in the changelog.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
CHANGELOG.md |
Documents new TIMEOUT_ERROR semantics for tx-status-related timeouts. |
chain/jsonrpc/src/lib.rs |
Implements timeout-reason reporting and refactors tx-status polling logic. |
chain/jsonrpc/src/api/transactions.rs |
Maps TxStatusError::TimeoutError into the richer TimeoutErrorReason::Error form. |
chain/jsonrpc/openapi/src/main.rs |
Bumps OpenAPI spec version. |
chain/jsonrpc/openapi/openapi.json |
Updates schema to require TIMEOUT_ERROR.info.reason and defines TimeoutErrorReason. |
chain/jsonrpc-primitives/src/types/transactions.rs |
Adds TimeoutErrorReason and makes TimeoutError carry reason; adds serde tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| near_jsonrpc_primitives::types::transactions::RpcTransactionError::TimeoutError | ||
| near_jsonrpc_primitives::types::transactions::RpcTransactionError::TimeoutError { | ||
| reason: TimeoutErrorReason::NotObserved, | ||
| } |
| near_jsonrpc_primitives::types::transactions::RpcTransactionError::TimeoutError | ||
| let debug_info = format!("tx_exists timeout, last error: {:?}", last_error); | ||
| let reason = TimeoutErrorReason::Error { debug_info }; | ||
| near_jsonrpc_primitives::types::transactions::RpcTransactionError::TimeoutError { reason } |
There was a problem hiding this comment.
Why do have to fully-qualify this? Looks pretty ugly to me.
There was a problem hiding this comment.
yeah I hate it too but it is the convention in jsonrpc
There was a problem hiding this comment.
I think we can fix that, I don't see any reason why we're using fully qualified ids there 🤔
There was a problem hiding this comment.
note: This ugliness was introduced there so we eventually come back and do something about it in a constructive way.
There are too many types in nearcore that are named almost the same way which leads to problems when nearcore engineers start using internal types instead of JSON RPC ones and then introduce unexpected breaking changes by accident while changing seemingly internal types.
There was a problem hiding this comment.
Thanks for sharing the context! I had no clue about the original motivation. It's not clear to me how fully qualified types prevent using internal types though. Anyway, what would be a constructive solution? A lint that only allows imports from a certain whitelist - currently it would be sth like {std, near_jsonrpc_*, near_primitives, near_client, ... }
| ControlFlow::Break(Ok(result.into())) | ||
| } else { | ||
| ControlFlow::Continue(TimeoutErrorReason::Pending { | ||
| status: Box::new(result.into()), |
There was a problem hiding this comment.
Not really a big deal, but we're calling result.into() and Box::new() every block just to keep the reason that might or might not be used. Maybe we could stash the raw result instead and convert it in tx_status_on_timeout?
| ) -> ControlFlow<Result<RpcTransactionResponse, RpcTransactionError>, TimeoutErrorReason> { | ||
| let (tx_hash, account_id) = tx_info.to_tx_hash_and_account(); | ||
| let request = TxStatus { tx_hash, signer_account_id: account_id.clone(), fetch_receipt }; | ||
| match self.view_client_send(request).await { |
There was a problem hiding this comment.
I believe that when the RPC node doesn't track the relevant shard, ViewClientActor::get_tx_status will return Ok(TxStatusView { execution_outcome: None, status: TxExecutionStatus::None }) not an Err variant, which in turn will be converted to Pending instead of NotObserved in poll_tx_status. That seems contrary to the not_observed/pending semantics described here. Some not observed transactions will be treated as observed, but pending finalization.
| ProcessTxResponse::NoResponse => Self::TimeoutError, | ||
| ProcessTxResponse::NoResponse => Self::TimeoutError { | ||
| reason: TimeoutErrorReason::Error { | ||
| debug_info: "no response from the node".to_string(), |
There was a problem hiding this comment.
Is this message intentionally different than "the node timed out fetching the transaction status" in transactions.rs? If yes, I don't quite understand the difference. If no, maybe add some constructor with a standardized message.
jancionear
left a comment
There was a problem hiding this comment.
I think the change makes sense, but it doesn't solve the biggest UX problems that we have. I remember there were two big ones:
-
The RPC node sends the transaction to the chunk producer, which might reject the transaction based on its state (validity period, congestion, mempool full, etc), but it doesn't let the RPC node know that the transaction was rejected. In this case the user thinks that the transaction was successfully submitted and waits for the result, even though it was rejected early on. Ideally we would notify the user that the transaction was rejected and why.
-
When the RPC node doesn't track the shard it responds with a TimeoutError, which makes no sense.
I think the RPC layer deserves some sort of hollistic approach where we would go through every Err and NoResponse in the json rpc handler and the view client and make sure that the error mapping is sane.
Not sure about the backwards compatibility as well. We need to figure out how to handle this in OpenAPI and OpenRPC.
| Err(RpcTransactionError::TimeoutError { reason }) | ||
| } | ||
|
|
||
| /// Send a transaction idempotently (subsequent send of the same transaction will not cause |
There was a problem hiding this comment.
Also refactors tx_status_fetch into focused helpers (poll_tx_status, detect_invalid_tx, tx_status_on_timeout).
nit: there's no detect_invalid_tx
There was a problem hiding this comment.
Probably renamed to validate_tx. But I think that detect_invalid_tx is actually a better name, as it sometimes is unable to determine if tx is valid or not.
Sure, I wasn't aiming or claiming to solve the UX, I just wanted to improve the UI since the lack of context in the tx_status timeouts surfaces every now and then in user reports.
Thanks, that's a good point and Adam raised the same issue. Let me check with DevEx team what are our options. |
The timeouts might also be coming from this code which converts any error inside of #[must_use]
pub fn process_tx(
&self,
tx: SignedTransaction,
is_forwarded: bool,
check_only: bool,
) -> ProcessTxResponse {
unwrap_or_return!(self.process_tx_internal(&tx, is_forwarded, check_only), {
let signer = self.validator_signer.get();
let me = signer.as_ref().map(|signer| signer.validator_id());
tracing::debug!(target: "client", ?me, ?tx, "dropping tx");
ProcessTxResponse::NoResponse
})
} |
Currently tx_status returns timeout error, while hiding the actual inner error. This changes the semantics to provide fuller information about what actually led to the timeout.
Possible timeout causes:
not_observed: the transaction was never seen on chainpending: the transaction was observed but is below the requested finalityerror: the node could not produce a status before the timeoutAlso refactors
tx_status_fetchinto focused helpers (poll_tx_status,detect_invalid_tx,tx_status_on_timeout).