Return HTTP 503 on concurrent rename instead of 500 by vigneshio · Pull Request #4646 · apache/polaris

vigneshio · 2026-06-08T08:33:23Z

When a concurrent modification occurs during renameTable/renameView, the
operation now returns HTTP 503 (Service Unavailable) instead of HTTP 500
(Internal Server Error), signalling a transient, retryable condition.

Why

renameTableLike in IcebergCatalog previously threw a bare RuntimeException
for both TARGET_ENTITY_CONCURRENTLY_MODIFIED and ENTITY_CANNOT_BE_RESOLVED.
IcebergExceptionMapper maps unknown RuntimeExceptions to 500, so clients
received an opaque server error instead of a retryable signal. The code even
admitted this was temporary: "this is temporary. Should throw a special error
that will be caught and retried".

Both statuses are documented in BaseResult.ReturnStatus as retryable
concurrency conditions ("the client should retry"), and the rename
implementation (TransactionalMetaStoreManagerImpl.renameEntityInCurrentTxn)
returns both on real races, so each branch is reachable in practice.

HTTP 409 is intentionally not used here: the Iceberg REST rename endpoint
reserves 409 for "the target identifier to rename to already exists" (which
Polaris already uses via the ENTITY_ALREADY_EXISTS case). Mapping a concurrent
modification to 409 would collide with that meaning, so a spec-compliant client
would surface it as AlreadyExistsException rather than retrying. The rename
endpoint lists 503 for the transient case, so this change uses 503 instead.
(Thanks to @nandorKollar for catching the 409 semantic mismatch in review.)

Changes

IcebergCatalog.renameTableLike: replaced the bare RuntimeException with
ServiceUnavailableException (HTTP 503) for concurrent rename failures
(TARGET_ENTITY_CONCURRENTLY_MODIFIED, ENTITY_CANNOT_BE_RESOLVED).
Added parameterized test testConcurrencyConflictRenameTable covering both
statuses, verifying they surface as ServiceUnavailableException (503).
CHANGELOG.md: added a Fixes entry.

Known follow-up (out of scope)

The same renameTableLike switch still lacks a case for
CATALOG_PATH_CANNOT_BE_RESOLVED (cross-namespace renames where the destination
path can't be resolved). That falls through to default → IllegalStateException → 500, whereas updateTableLike handles it as NotFoundException (404). This
is a pre-existing gap not introduced by this change; tracking separately.

Testing

./gradlew :polaris-runtime-service:compileTestJava — passes
./gradlew :polaris-runtime-service:spotlessCheck :polaris-runtime-service:checkstyleMain :polaris-runtime-service:checkstyleTest — passes
./gradlew :polaris-runtime-service:test --tests "org.apache.polaris.service.catalog.iceberg.IcebergCatalogRelationalTest.testConcurrencyConflictRenameTable" — passes (both parameterized cases)
./gradlew :polaris-runtime-service:test --tests "org.apache.polaris.service.catalog.iceberg.IcebergCatalogRelationalTest.testConcurrencyConflictUpdateTableDuringFinalTransaction" — passes (regression check)

Checklist

🛡️ Don't disclose security issues! (contact security@apache.org)
🔗 Clearly explained why the changes are needed
🧪 Added/updated tests with good coverage, or manually tested (and explained how)
💡 Added comments for complex logic
🧾 Updated CHANGELOG.md (if needed)
📚 Updated documentation in site/content/in-dev/unreleased (if needed)

nandorKollar · 2026-06-08T09:54:26Z

Although this change makes sense for me, I think there's a disconnect between Iceberg spec, and our implementation. The spec for POST /v1/{prefix}/tables/rename and POST /v1/{prefix}/views/rename states that 409 return code is Conflict - The target identifier to rename to already exists as a table or view, which is not exactly the same as two conflicting transaction, is it?

vigneshio · 2026-06-08T11:20:20Z

Thanks @nandorKollar , good catch.

The rename endpoints define 409 as "the target identifier to rename to already exists", and we already use 409 for that case (ENTITY_ALREADY_EXISTS → AlreadyExistsException). Mapping a concurrent-modification failure to 409 overloads that meaning, so a spec-compliant client would read it as "already exists" rather than a transient condition. (And my "consistency with updateTableLike" reasoning doesn't really apply, since 409 means a commit conflict on the update endpoint but "already exists" on rename.)

I've updated the PR to map the concurrency statuses (TARGET_ENTITY_CONCURRENTLY_MODIFIED, ENTITY_CANNOT_BE_RESOLVED) to 503 ServiceUnavailable, which the rename spec lists for the transient case, keeping 409 strictly for "already exists".

(If Team prefer server-side retry for rename as the original TODO - that'd be a larger change; happy to do it as a follow-up.)

nandorKollar · 2026-06-08T11:46:25Z

+        // here because the rename endpoint reserves 409 for "target already exists" (handled by
+        // the ENTITY_ALREADY_EXISTS case above).
        case BaseResult.ReturnStatus.TARGET_ENTITY_CONCURRENTLY_MODIFIED:
        case BaseResult.ReturnStatus.ENTITY_CANNOT_BE_RESOLVED:


I think we should narrow down this to case BaseResult.ReturnStatus.TARGET_ENTITY_CONCURRENTLY_MODIFIED, I'm not sure about ENTITY_CANNOT_BE_RESOLVED. Can a retry solve the problem with entity resolution?

nandorKollar · 2026-06-08T11:51:16Z

Thanks @nandorKollar , good catch.

The rename endpoints define 409 as "the target identifier to rename to already exists", and we already use 409 for that case (ENTITY_ALREADY_EXISTS → AlreadyExistsException). Mapping a concurrent-modification failure to 409 overloads that meaning, so a spec-compliant client would read it as "already exists" rather than a transient condition. (And my "consistency with updateTableLike" reasoning doesn't really apply, since 409 means a commit conflict on the update endpoint but "already exists" on rename.)

I've updated the PR to map the concurrency statuses (TARGET_ENTITY_CONCURRENTLY_MODIFIED, ENTITY_CANNOT_BE_RESOLVED) to 503 ServiceUnavailable, which the rename spec lists for the transient case, keeping 409 strictly for "already exists".

(If Team prefer server-side retry for rename as the original TODO - that'd be a larger change; happy to do it as a follow-up.)

Thanks, 503 sounds better, but still not the best response code IMHO. I think it is intended to indicate the client to slow down. It seems to me, that Iceberg spec doesn't have a clear response code for rename operations, which indicate that there was a conflict, the client should retry the operation.

nandorKollar · 2026-06-08T13:59:28Z

Opened a discussion on the dev list: https://lists.apache.org/thread/tr8zh8121t2jb41s0q2yd9s73y2tp2tq

vigneshio · 2026-06-08T14:18:45Z

Opened a discussion on the dev list: https://lists.apache.org/thread/tr8zh8121t2jb41s0q2yd9s73y2tp2tq

I'll hold off finalizing until the dev list discussion wraps up, then update this - splitting ENTITY_CANNOT_BE_RESOLVED to 404 and aligning the rest with whatever we decide. Thanks for taking it to the list.

adutra · 2026-06-10T12:28:10Z

@@ -2494,10 +2495,14 @@ private void renameTableLike(
        case BaseResult.ReturnStatus.ENTITY_NOT_FOUND:


If the goal is to tackle concurrent renames, I think we need to handle CATALOG_PATH_CANNOT_BE_RESOLVED as well. It's raised when a catalog path resolution fails during a write, see TransactionalMetaStoreManagerImpl.renameEntity(). I suggest it be mapped to NoSuchNamespaceException.

adutra · 2026-06-10T12:36:15Z

+        // here because the rename endpoint reserves 409 for "target already exists" (handled by
+        // the ENTITY_ALREADY_EXISTS case above).
        case BaseResult.ReturnStatus.TARGET_ENTITY_CONCURRENTLY_MODIFIED:
        case BaseResult.ReturnStatus.ENTITY_CANNOT_BE_RESOLVED:


ENTITY_CANNOT_BE_RESOLVED is very similar to CATALOG_PATH_CANNOT_BE_RESOLVED: in TransactionalMetaStoreManagerImpl.renameEntity() the former is raised when the old path cannot be resolved, and the latter, when the new path cannot be resolved.

polaris/polaris-core/src/main/java/org/apache/polaris/core/persistence/transactional/TransactionalMetaStoreManagerImpl.java

Lines 1205 to 1207 in 005e889

if (resolver.isFailure()) {

return new EntityResult(BaseResult.ReturnStatus.ENTITY_CANNOT_BE_RESOLVED, null);

}

polaris/polaris-core/src/main/java/org/apache/polaris/core/persistence/transactional/TransactionalMetaStoreManagerImpl.java

Lines 1238 to 1240 in 005e889

if (resolver.isFailure()) {

return new EntityResult(BaseResult.ReturnStatus.CATALOG_PATH_CANNOT_BE_RESOLVED, null);

}

I'd note though that the NoSQL persistence does not raise CATALOG_PATH_CANNOT_BE_RESOLVED.

I'd suggest to group them together and throw NoSuchNamespaceException instead.

However, NoSuchNamespaceException maps to 404 which is non-retriable. But the comments on BaseResult for both codes say they are retriable:

polaris/polaris-core/src/main/java/org/apache/polaris/core/persistence/dao/entity/BaseResult.java

Lines 76 to 84 in 46e6891

// the specified catalog path cannot be resolved. There is a possibility that by the time a call

// is made by the client to the persistent storage, something has changed due to concurrent

// modification(s). The client should retry in that case.

CATALOG_PATH_CANNOT_BE_RESOLVED(3),

// the specified entity (and its path) cannot be resolved. There is a possibility that by the

// time a call is made by the client to the persistent storage, something has changed due to

// concurrent modification(s). The client should retry in that case.

ENTITY_CANNOT_BE_RESOLVED(4),

I actually think the comments are wrong. If either the old or new path has been deleted by a concurrent commit, clients should not retry.

Interesting info, but IIRC TransactionalMetaStoreManagerImpl is not actually used in actual OSS call paths... perhaps only with the in-memory persistence, but JDBC does not use it either, I'm pretty sure 🤔

I ALWAYS get fooled by its name 😅

Looking at AtomicOperationMetaStoreManager.renameEntity() this time: oddly enough, it does not raise neither CATALOG_PATH_CANNOT_BE_RESOLVED nor ENTITY_CANNOT_BE_RESOLVED. It actually seems to not care about the validity of the entity path before and after 🤷‍♂️

Interesting... 🤔 @vigneshio : How did you hit these errors in practice?

If ENTITY_CANNOT_BE_RESOLVED is not a valid end expected response of a rename, then shouldn't we handle it as 'everything else', and throw new IllegalStateException( "Unknown error status " + returnedEntityResult.getReturnStatus());?

ENTITY_CANNOT_BE_RESOLVED is used by the NoSQL persistence.

adutra · 2026-06-10T12:57:03Z

+        // Transient concurrency conditions: surface as 503 so clients can retry. We avoid 409
+        // here because the rename endpoint reserves 409 for "target already exists" (handled by
+        // the ENTITY_ALREADY_EXISTS case above).
        case BaseResult.ReturnStatus.TARGET_ENTITY_CONCURRENTLY_MODIFIED:


I agree, and it's very unfortunate, that we can't use 409. It describes perfectly what happened, and it's retriable; but the Iceberg spec is very opinionated and maps this code to the "already exists" case (and seems to imply that the error is not retriable, in contradiction with the HTTP spec).

We can't force a 409 for other use cases because clients would surface the error as an AlreadyExistsException:

https://github.com/apache/iceberg/blob/17fc6da837442443421cfbac01ff2941a820ba20/core/src/main/java/org/apache/iceberg/rest/ErrorHandlers.java#L156-L157

So, I agree that ServiceUnavailableException is the least worst choice. It's retriable, which is all that counts.

Note: 429 (Too Many Requests) is also retriable but should in theory include a Retry-After header in the response. It may trigger a client throttling that would be undesirable.

I agree, and it's very unfortunate, that we can't use 409. It describes perfectly what happened, and it's retriable; but the Iceberg spec is very opinionated and maps this code to the "already exists" case (and seems to imply that the error is not retriable, in contradiction with the HTTP spec).

Exactly, this conflict is a non-resolvable conflict, which is normally not solvable with a retry.

Retry-After is optional in 429 responses, AFAIK. If we go with 429, I'd think we should not set Retry-After to indicate that the server is not asking the client to back off, and the client is free to decide when to retry.

nandorKollar · 2026-06-12T07:20:41Z

@vigneshio it seems there is a consensus on the mailing list that 503 is the best choice among the available options for signaling a conflict error. We should probably open a follow-up issue to implement server-side retries for conflicting rename operations. However, ENTITY_CANNOT_BE_RESOLVED should return a 404 instead. cc. @flyrain @adutra @dimas-b @rmannibucau

rmannibucau · 2026-06-12T07:46:52Z

I'll just write it for the record since you kind of converge but I still think this lead to a wrong behavior on the client side:

503 is used to notify the client the server is down or overloaded, this means this is cacheable per client/application (understand not per entity) on the client side, this is totally wrong there
429 has some underlying semantic of retries (using headers or a default - often an exp backoff)
I understand the issue you point with 409 but from a client perspective it is not an issue at all IMHO
412 can be a work around breaking less the client and gateways IMHO if really an issue (but think iceberg can be pinged to refine the 409, there is no valid reason it is not used there)

Another thing to consider I think is that if the management of the mutations is implemented as a queue internally (potentially distributed but let's stick to the design) then you never have this ambiguity and plain iceberg status are fine (409, 404 mainly there), so maybe the question is not which status but more how to implement it right - think polaris can just delay the response to the previous execution "end" somehow.

The overall concern is using a semantic the client and gateways/proxies know and associate it another meaning leading to a wrong behavior (the 503 global cache - think circuit breakers - is a good example of that).

vigneshio · 2026-06-12T14:23:09Z

Thanks @dimas-b @nandorKollar @adutra . Based on the discussion on the dev list, I've updated the PR:

TARGET_ENTITY_CONCURRENTLY_MODIFIED → 503 (temporary error, can be retried)
ENTITY_CANNOT_BE_RESOLVED and CATALOG_PATH_CANNOT_BE_RESOLVED → 404 NoSuchNamespaceException, grouped together as suggested by @adutra.

I have also opened follow-up issue #4729 for server-side retries on rename conflicts. I agree with @rmannibucau that server-side retry is the better long-term solution, and that work can be handled separately.

Concurrent modification during renameTable/renameView now returns 503 (retriable transient conflict) instead of a bare 500. Source/target paths that cannot be resolved (ENTITY_CANNOT_BE_RESOLVED, CATALOG_PATH_CANNOT_BE_RESOLVED) now return 404 NoSuchNamespaceException rather than 500, since a concurrently dropped path is not retriable.

adutra

Thanks @vigneshio for this PR!

dimas-b · 2026-06-12T16:07:26Z

Re: 412 - I think it is closely related to If-Unmodified-Since / If-Match in the request headers, so I'm not sure it's a perfect match for this case either 🤷

rmannibucau · 2026-06-12T16:40:23Z

@dimas-b agree but 412 doesnt have the semantic issues of 503 and 429 in middleware (it is poorly used today AFAIK) so "least worse" from my window ;) - the real issue is trying to use a spec for assets for an API, this always had been wrong by design and this is where *RPC style solutions are way more relevant - hopefully iceberg get a JSON-RPC catalog a day ;)

dimas-b · 2026-06-12T17:56:41Z

None of the options are ideal 😅 We're effectively trying to work around an IRC spec problem, while allowing reasonable clients to recover from this kind of failure without having to make any Polaris-specific assumptions.

As I commented on dev, from my POV 503 is the easiest to handle on the client side because it does not carry any implications about the state of the catalog. The server merely could not handle the request (from the client perspective).

With 503, the RFC is pretty lenient towards servers. I do not think clients can assume any service-wise outage on receiving a 503 response from one particular request.

As Polaris will not provide a Retry-After response header in this case, the client is free to retry at any time.

rmannibucau · 2026-06-12T20:51:10Z

The 503 (Service Unavailable) status code indicates that the server is currently unable to handle the request due to a temporary overload or scheduled maintenance, which will likely be alleviated after some delay.

So - as 504 is - is literally global/for the server so often resilience4j, mp fault tolérance and friends use an impl opening the circuit breaker after a few occurrences - and it is not insane and is quite common on load balancers as behavior.

github-project-automation Bot added this to Basic Kanban Board Jun 8, 2026

github-project-automation Bot moved this to PRs In Progress in Basic Kanban Board Jun 8, 2026

vigneshio force-pushed the fix-rename-concurrent-conflict-409 branch from a20a385 to d329502 Compare June 8, 2026 11:10

vigneshio changed the title ~~Return HTTP 409 Conflict on concurrent rename instead of 500~~ Return HTTP 503 on concurrent rename instead of 500 Jun 8, 2026

nandorKollar reviewed Jun 8, 2026

View reviewed changes

adutra reviewed Jun 10, 2026

View reviewed changes

vigneshio force-pushed the fix-rename-concurrent-conflict-409 branch from d329502 to 15da8eb Compare June 12, 2026 13:42

vigneshio mentioned this pull request Jun 12, 2026

Server-side retry for conflicting rename operations #4729

Open

vigneshio force-pushed the fix-rename-concurrent-conflict-409 branch from 15da8eb to 51bea3d Compare June 12, 2026 14:26

adutra approved these changes Jun 12, 2026

View reviewed changes

github-project-automation Bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Jun 12, 2026

dimas-b approved these changes Jun 12, 2026

View reviewed changes

nandorKollar approved these changes Jun 12, 2026

View reviewed changes

		@@ -2494,10 +2495,14 @@ private void renameTableLike(
		case BaseResult.ReturnStatus.ENTITY_NOT_FOUND:

	if (resolver.isFailure()) {
	return new EntityResult(BaseResult.ReturnStatus.ENTITY_CANNOT_BE_RESOLVED, null);
	}

	// the specified catalog path cannot be resolved. There is a possibility that by the time a call
	// is made by the client to the persistent storage, something has changed due to concurrent
	// modification(s). The client should retry in that case.
	CATALOG_PATH_CANNOT_BE_RESOLVED(3),

	// the specified entity (and its path) cannot be resolved. There is a possibility that by the
	// time a call is made by the client to the persistent storage, something has changed due to
	// concurrent modification(s). The client should retry in that case.
	ENTITY_CANNOT_BE_RESOLVED(4),

Conversation

vigneshio commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Changes

Known follow-up (out of scope)

Testing

Checklist

Uh oh!

nandorKollar commented Jun 8, 2026

Uh oh!

vigneshio commented Jun 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nandorKollar commented Jun 8, 2026

Uh oh!

nandorKollar commented Jun 8, 2026

Uh oh!

vigneshio commented Jun 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dimas-b Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nandorKollar commented Jun 12, 2026

Uh oh!

rmannibucau commented Jun 12, 2026

Uh oh!

vigneshio commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adutra left a comment

Choose a reason for hiding this comment

Uh oh!

dimas-b commented Jun 12, 2026

Uh oh!

rmannibucau commented Jun 12, 2026

Uh oh!

dimas-b commented Jun 12, 2026

Uh oh!

rmannibucau commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vigneshio commented Jun 8, 2026 •

edited

Loading

dimas-b Jun 10, 2026 •

edited

Loading

vigneshio commented Jun 12, 2026 •

edited

Loading