Skip to content

[#11726] refactor(core): harden Capability method calls with correct TCCL#11755

Open
JoegenUSTC wants to merge 2 commits into
apache:mainfrom
JoegenUSTC:fix/capability-classloader-proxy
Open

[#11726] refactor(core): harden Capability method calls with correct TCCL#11755
JoegenUSTC wants to merge 2 commits into
apache:mainfrom
JoegenUSTC:fix/capability-classloader-proxy

Conversation

@JoegenUSTC

@JoegenUSTC JoegenUSTC commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Wrap the Capability delegate returned by CapabilityHelpers.getCapability()
in a JDK dynamic proxy whose InvocationHandler executes each method call
inside classLoader.withClassLoader(), ensuring the TCCL is always set
correctly before any Capability method runs.

  • CapabilityHelpers.getCapability() now returns a classloader-aware proxy;
    falls back to the raw delegate when classLoader is null (built-in catalogs).
  • Unwraps InvocationTargetException in the proxy so callers see the original
    exception, consistent with the rest of the codebase.
  • CatalogWrapper.classLoader() (package-private) added to expose the
    IsolatedClassLoader for proxy construction; placed after public methods
    per project member ordering convention.

Why are the changes needed?

CatalogWrapper.capabilities() returns a Capability object loaded by the
catalog's IsolatedClassLoader. This object is currently used outside any
withClassLoader() boundary (e.g., in TableNormalizeDispatcher,
SchemaNormalizeDispatcher). Any Capability implementation that uses
TCCL-dependent patterns — such as ServiceLoader.load() or Class.forName(name)
without an explicit classloader argument — would run with the server classloader
as TCCL and silently fail to locate classes from the catalog plugin jars.

This PR implements Expected Fix option 2 from #11726: proxy Capability
method calls through the isolated classloader automatically, so callers are
always within the correct classloader context regardless of where the invocation
occurs.

Note: this PR does not address the NoClassDefFoundError triggered by
switch-on-enum synthetic classes in existing Capability implementations.
That is a separate closed-classloader lifecycle issue handled by:

Related: #11726

Does this PR introduce any user-facing change?

No. This change is internal to the core module. No public APIs or
configuration properties are affected.

How was this patch tested?

Added TestCapabilityHelpers with unit tests covering:

  1. testGetCapabilityReturnsProxy — returned object is a JDK dynamic proxy
  2. testProxyExecutesMethodWithinCatalogClassloader — TCCL is set to the
    catalog's IsolatedClassLoader during method execution
  3. testProxyRestoresTcclAfterMethodCall — TCCL is restored after normal return
  4. testProxyRestoresTcclAfterException — TCCL is restored even when the
    delegate throws
  5. testProxyPropagatesReturnValue — return value from delegate is unchanged

@yuqi1129

Copy link
Copy Markdown
Contributor

Thanks for the work on this! I dug into it a bit and want to share one concern — I could be missing something, so please correct me if I'm wrong.

I think the proxy here mainly switches the thread context classloader (TCCL) through withClassLoader. But for this particular NoClassDefFoundError, I'm not sure the TCCL is the thing that matters. The synthetic $1 class (from the switch-on-enum) is loaded by the defining classloader of HiveCatalogCapability (the IsolatedClassLoader), not by the TCCL. So changing the TCCL may not change which classloader loads $1.

Also, IsolatedClassLoader.CustomURLClassLoader.loadClass only delegates to baseClassLoader and the exec jars — it never reads the thread context classloader. So withClassLoader doesn't really affect class loading here.

What seems to actually trigger the error: the catalog cache removalListener calls CatalogWrapper.close()classLoader.close(), which closes the IsolatedClassLoader. If a Capability instance is still used after that (and $1 hasn't been loaded yet), loading $1 from the now-closed classloader fails. Switching the TCCL doesn't reopen it.

I wrote a small standalone test to check this:

SwitchCap (no close)                                        : OK -> true
SwitchCap (close loader, then call)                         : FAILED -> NoClassDefFoundError: SwitchCap$1
SwitchCap (close loader, then call via withClassLoader/TCCL): FAILED -> NoClassDefFoundError: SwitchCap$1
IfElseCap (close loader, then call)                         : OK -> true

So in the closed-loader case, wrapping the call in withClassLoader still fails, while the if/else version (#11707) passes because there is no $1 to load.

Based on this, my gentle suggestions:

  • Maybe keep [#11706] fix(catalog-hive): replace switch-on-enum with if-else in HiveCatalogCapability.caseSensitiveOnName() #11707 (or the same if/else change for Hive and Glue) — it looks like that's the part that really fixes the error, so I'd avoid closing it.
  • The proxy is still a nice hardening for capability methods that truly rely on the TCCL (e.g. ServiceLoader, or Class.forName using the TCCL). I'm just not sure it covers the $1 case here.
  • A more complete root fix might be about the classloader lifecycle (not closing the IsolatedClassLoader while capability instances are still in use), or simply avoiding the synthetic class.

What do you think? Happy to share the small repro code if it helps.

@JoegenUSTC

Copy link
Copy Markdown
Contributor Author

Thanks for the work on this! I dug into it a bit and want to share one concern — I could be missing something, so please correct me if I'm wrong.

I think the proxy here mainly switches the thread context classloader (TCCL) through withClassLoader. But for this particular NoClassDefFoundError, I'm not sure the TCCL is the thing that matters. The synthetic $1 class (from the switch-on-enum) is loaded by the defining classloader of HiveCatalogCapability (the IsolatedClassLoader), not by the TCCL. So changing the TCCL may not change which classloader loads $1.

Also, IsolatedClassLoader.CustomURLClassLoader.loadClass only delegates to baseClassLoader and the exec jars — it never reads the thread context classloader. So withClassLoader doesn't really affect class loading here.

What seems to actually trigger the error: the catalog cache removalListener calls CatalogWrapper.close()classLoader.close(), which closes the IsolatedClassLoader. If a Capability instance is still used after that (and $1 hasn't been loaded yet), loading $1 from the now-closed classloader fails. Switching the TCCL doesn't reopen it.

I wrote a small standalone test to check this:

SwitchCap (no close)                                        : OK -> true
SwitchCap (close loader, then call)                         : FAILED -> NoClassDefFoundError: SwitchCap$1
SwitchCap (close loader, then call via withClassLoader/TCCL): FAILED -> NoClassDefFoundError: SwitchCap$1
IfElseCap (close loader, then call)                         : OK -> true

So in the closed-loader case, wrapping the call in withClassLoader still fails, while the if/else version (#11707) passes because there is no $1 to load.

Based on this, my gentle suggestions:

  • Maybe keep [#11706] fix(catalog-hive): replace switch-on-enum with if-else in HiveCatalogCapability.caseSensitiveOnName() #11707 (or the same if/else change for Hive and Glue) — it looks like that's the part that really fixes the error, so I'd avoid closing it.
  • The proxy is still a nice hardening for capability methods that truly rely on the TCCL (e.g. ServiceLoader, or Class.forName using the TCCL). I'm just not sure it covers the $1 case here.
  • A more complete root fix might be about the classloader lifecycle (not closing the IsolatedClassLoader while capability instances are still in use), or simply avoiding the synthetic class.

What do you think? Happy to share the small repro code if it helps.

Thanks @yuqi1129 — this is a thorough and well-evidenced analysis. Let me address each point and ask for your input on the path forward.

On TCCL and $1 loading — you're right.

The synthetic $1 class is resolved by the defining classloader of HiveCatalogCapability (the IsolatedClassLoader), not the TCCL. IsolatedClassLoader.CustomURLClassLoader.loadClass delegates only to baseClassLoader and the exec jars — it never consults the TCCL. So withClassLoader() has no effect on which classloader loads $1. Your test result makes this unambiguous.

On the closed-classloader scenario — this is the relevant failure path here.

org.apache.gravitino.catalog.hive.* is already covered by isCatalogClass(), so isSharedClass() returns false for HiveCatalogCapability$1 — the delegation order is not the issue. Your closed-classloader scenario is the correct diagnosis: the cache removalListener explicitly calls CatalogWrapper.close()IsolatedClassLoader.close(), and any subsequent use of a Capability instance that triggers $1 loading will fail permanently.

For context: the isSharedClass delegation issue for org.apache.gravitino.hive.* classes was fixed separately in #11705, now merged.

On the root fix — a note on complexity.

The reference-counted deferred-close approach (only calling IsolatedClassLoader.close() once all Capability references are released) is the right direction, but has real implementation challenges: detecting when Capability proxies are no longer reachable requires PhantomReference/WeakReference with a ReferenceQueue, and we still need to release JAR file handles promptly to avoid resource exhaustion. We'd welcome your input on the design — especially if your repro code reveals additional constraints.

Revised scope of this PR (#11755) — would appreciate your assessment.

Given your findings, I'd like to narrow this PR to what it actually delivers: a TCCL hardening for Capability methods that genuinely depend on the thread context classloader (e.g. ServiceLoader, Class.forName). It does not fix the $1 closed-classloader problem, but it is a defensively useful change for those patterns.

Would you consider this narrowed scope acceptable as a standalone improvement? Or would you prefer it be folded into the lifecycle fix in #11726?

For the $1 issue specifically, I'll re-open #11706 as the immediate fix, with #11726 tracking the lifecycle root fix.

Please do share the repro code — it would be very valuable for #11726. Thank you again.

@JoegenUSTC JoegenUSTC changed the title [#11726] fix(core): wrap Capability in classloader-aware JDK proxy to prevent synthetic-class CNFE improvement(core): harden Capability method calls with correct TCCL via JDK proxy Jun 23, 2026
@JoegenUSTC JoegenUSTC changed the title improvement(core): harden Capability method calls with correct TCCL via JDK proxy [#11726] refactor(core): harden Capability method calls with correct TCCL via JDK proxy Jun 23, 2026
@github-actions

github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown

Code Coverage Report

Overall Project 67.17% +0.05% 🟢
Files changed 72.4% 🟢

Module Coverage
aliyun 1.72% 🔴
api 46.82% 🟢
authorization-common 85.96% 🟢
aws 3.66% 🔴
azure 2.47% 🔴
catalog-common 10.4% 🔴
catalog-fileset 80.23% 🟢
catalog-glue 66.91% 🟢
catalog-hive 79.42% 🟢
catalog-jdbc-clickhouse 80.02% 🟢
catalog-jdbc-common 44.22% 🟢
catalog-jdbc-doris 80.28% 🟢
catalog-jdbc-hologres 54.03% 🟢
catalog-jdbc-mysql 79.23% 🟢
catalog-jdbc-oceanbase 80.91% 🟢
catalog-jdbc-postgresql 82.29% 🟢
catalog-jdbc-starrocks 78.51% 🟢
catalog-kafka 77.01% 🟢
catalog-lakehouse-generic 58.53% 🟢
catalog-lakehouse-hudi 79.1% 🟢
catalog-lakehouse-iceberg 85.86% 🟢
catalog-lakehouse-paimon 82.14% 🟢
catalog-model 77.72% 🟢
cli 44.51% 🟢
client-java 78.01% 🟢
common 50.17% 🟢
core 82.6% -0.36% 🟢
filesystem-hadoop3 77.27% 🟢
flink 0.0% 🔴
flink-common 47.12% 🟢
flink-runtime 0.0% 🔴
gcp 14.12% 🔴
hadoop-common 10.88% 🔴
hive-metastore-common 53.77% 🟢
iceberg-common 58.15% 🟢
iceberg-rest-server 73.9% 🟢
idp-basic 85.71% 🟢
integration-test-common 0.0% 🔴
jobs 66.17% 🟢
lance-common 20.83% 🔴
lance-rest-server 60.13% 🟢
lineage 53.02% 🟢
optimizer 82.95% 🟢
optimizer-api 21.95% 🔴
server 86.09% 🟢
server-common 74.18% 🟢
spark 28.57% 🔴
spark-common 41.66% 🟢
trino-connector 40.25% 🟢
Files
Module File Coverage
core CapabilityHelpers.java 80.35% 🟢
CatalogManager.java 68.61% 🟢

@JoegenUSTC

Copy link
Copy Markdown
Contributor Author

The CI failure in :plugins:idp-basic:test is unrelated to this PR's changes.

Root cause: test state leakage between repeated Gradle invocations with different testMode/jdbcBackend combinations — user1 created in a previous run is not cleaned up, causing a duplicate key conflict (HTTP 409) in the next run.

Could a maintainer re-run the failed job, or confirm this is a known flaky test?

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens catalog Capability usage by wrapping the Capability delegate returned from CatalogManager.CatalogWrapper#capabilities() in a JDK dynamic proxy that executes each method invocation inside the catalog’s IsolatedClassLoader.withClassLoader(...) scope, ensuring the correct TCCL is set for every Capability call.

Changes:

  • Add a CapabilityHelpers.getCapability(...) proxy wrapper that switches/restores TCCL per method invocation.
  • Expose the catalog wrapper’s IsolatedClassLoader via a new CatalogWrapper#classLoader() accessor to support proxying.
  • Add unit tests validating proxy creation and TCCL restoration/exception behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
core/src/main/java/org/apache/gravitino/catalog/CapabilityHelpers.java Wrap returned Capability in a JDK proxy that runs each invocation within IsolatedClassLoader.withClassLoader(...).
core/src/main/java/org/apache/gravitino/catalog/CatalogManager.java Add CatalogWrapper#classLoader() accessor to expose the isolated loader for proxy construction.
core/src/test/java/org/apache/gravitino/catalog/TestCapabilityHelpers.java Add unit tests for proxy existence and TCCL switching/restoration semantics.

Comment thread core/src/main/java/org/apache/gravitino/catalog/CatalogManager.java
@diqiu50

diqiu50 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

I think the core issue is not just whether the TCCL is set correctly. The bigger boundary violation is that CatalogWrapper returns a Capability object created by the isolated classloader and lets callers invoke it outside the wrapper/classloader boundary.

A safer fix would be to avoid letting plugin-loaded executable objects escape from CatalogWrapper. Instead of returning Capability, CatalogWrapper should expose a doWithCapability(...) method and execute all Capability method calls inside classLoader.withClassLoader(...). Callers should only receive safe values/results, not the plugin Capability instance itself.

On the other hand, the format of this PR does not align with our style guidelines. Please update it accordingly.

…JDK proxy

Wrap the Capability delegate returned by CapabilityHelpers.getCapability()
in a JDK dynamic proxy whose InvocationHandler executes each method call
inside classLoader.withClassLoader(), ensuring the thread context
classloader (TCCL) is always set correctly before any Capability method runs.

Motivation: Some Capability implementations may use TCCL-dependent patterns
such as ServiceLoader.load() or Class.forName(name) without an explicit
classloader argument. Without this proxy, those calls would run with the
server classloader as TCCL and fail to locate classes from the catalog
plugin jars.

Scope: This PR provides TCCL hardening for the above patterns only. The
NoClassDefFoundError triggered by switch-on-enum synthetic classes is a
separate closed-classloader lifecycle issue, tracked in apache#11726 and fixed
at the catalog level in apache#11706.

Changes:
- CapabilityHelpers.getCapability() wraps delegate in JDK dynamic proxy;
  unwraps InvocationTargetException so callers see the original exception;
  returns delegate directly when classLoader is null (built-in catalogs).
- CatalogManager.CatalogWrapper.classLoader() added (package-private)
  to expose the IsolatedClassLoader for proxy construction; placed after
  public methods per project member ordering convention.
- TestCapabilityHelpers: proxy existence, TCCL switching/restoration,
  return value propagation, and exception TCCL restoration are all tested;
  @AfterEach closes IsolatedClassLoader to prevent resource leak.

Fixes apache#11726
@JoegenUSTC JoegenUSTC force-pushed the fix/capability-classloader-proxy branch from cad6324 to 1f08400 Compare June 24, 2026 06:49
@JoegenUSTC JoegenUSTC changed the title [#11726] refactor(core): harden Capability method calls with correct TCCL via JDK proxy [#11726] refactor(core): harden Capability method calls with correct TCCL Jun 24, 2026
@JoegenUSTC

Copy link
Copy Markdown
Contributor Author

I think the core issue is not just whether the TCCL is set correctly. The bigger boundary violation is that CatalogWrapper returns a Capability object created by the isolated classloader and lets callers invoke it outside the wrapper/classloader boundary.

A safer fix would be to avoid letting plugin-loaded executable objects escape from CatalogWrapper. Instead of returning Capability, CatalogWrapper should expose a doWithCapability(...) method and execute all Capability method calls inside classLoader.withClassLoader(...). Callers should only receive safe values/results, not the plugin Capability instance itself.

On the other hand, the format of this PR does not align with our style guidelines. Please update it accordingly.

Thanks for the thorough review, @diqiu50.

On the format: I've updated the PR description to follow the standard
template (What changes were proposed, Why are the changes needed,
Does this PR introduce any user-facing change, How was this patch tested),
and trimmed the title to remove the implementation detail.
Let me know if anything still needs adjustment.

On the architectural concern: You're right — the idiomatic pattern in
CatalogWrapper is doWithXxxOps(fn), which keeps plugin-loaded objects
inside the classloader boundary and lets callers receive only safe results.
The current capabilities() method breaks that contract by returning a live
plugin instance that can be invoked outside withClassLoader().

The proxy in this PR patches the symptom (any invocation on the escaped instance
is still guarded by TCCL), but it does not fix the boundary violation itself.
A doWithCapabilityOps(fn) method would be the architecturally correct solution.

I see two paths:

  1. Pivot this PR to implement doWithCapabilityOps(ThrowableFunction<Capability, R> fn)
    on CatalogWrapper, and refactor all ~40 getCapability() call sites across
    the 9 NormalizeDispatcher classes. This is the clean fix, but it's a broader change.
  2. Keep this PR as-is (minimal TCCL hardening for patterns like ServiceLoader /
    Class.forName), and track the doWithCapabilityOps boundary refactor as a
    follow-up under [Improvement] Ensure all Capability method calls are made within the correct classloader context #11726.

I'm happy to go with option 1 if that's what you prefer — just want to align on
direction before doing the larger refactor.

One additional note from @yuqi1129's earlier analysis: the $1 synthetic-class
NoClassDefFoundError we originally set out to fix is actually caused by a
closed classloader (cache eviction closes IsolatedClassLoader while a
Capability instance is still referenced externally), not by a missing
withClassLoader() guard. The proxy does not fix that scenario either.
I've updated the PR description accordingly — changed Fix: #11726 to
Related: #11726 — since the root CNFE requires the classloader lifecycle
work tracked in #11726, not this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants