AWS: Fix sdk-ScheduledExecutor thread leak by sharing a single executor across S3 clients#16129
AWS: Fix sdk-ScheduledExecutor thread leak by sharing a single executor across S3 clients#16129xiaoxuandev wants to merge 1 commit intoapache:mainfrom
Conversation
anoopj
left a comment
There was a problem hiding this comment.
I think this fixes the thread exhaustion problem in #15898 (thanks for working on this fix) by sharing the S3 executors.
But don't we still have a problem with S3 clients getting leaked? That would manifest as Java objects requiring heap space, open TCP connections etc?
So I worry that we are fixing the symptom and not the root cause of the leak.
|
@anoopj Thanks for the review! Good question. This also addresses the S3Client object leak as well, not just the thread exhaustion. Why S3Clients were leaking before: Each S3Client created its own ScheduledThreadPoolExecutor whose core threads block indefinitely on DelayedWorkQueue.take(). These threads are GC roots, and they hold a reference back to the executor via the worker's run loop. This anchors the executor in memory, and since the S3Client is reachable from the executor's task infrastructure, the entire chain (thread → executor → S3Client → S3FileIO) stays strongly reachable. GC cannot collect any of it. Why the shared executor fixes this: With a shared executor, the S3Client no longer owns any threads. After eviction, there are no GC roots holding the S3Client, it becomes unreachable and GC collects it normally. Verification on our test cluster (Thrift Server, cache.expiration-interval-ms=10, continuous queries): Repeated jmap -histo over time shows S3FileIO instance count oscillating between 20-26, never accumulating. |
|
Thank you for clarifying. But why does AWS SDK's |
|
Thank you for the explanation @xiaoxuandev. That makes sense. Basically the back reference is coming from Spark's use of I also did a quick cross-check on ADLS and GCS to see if we have a problem: neither creates per-instance thread pools (have static singletons), so seems low risk. |
anoopj
left a comment
There was a problem hiding this comment.
Will there be a separate PR for Async Client?
| public static final boolean S3_SHARED_SCHEDULED_EXECUTOR_ENABLED_DEFAULT = true; | ||
|
|
||
| /** | ||
| * Pool size for the shared ScheduledExecutorService. Matches the AWS SDK v2 default pool size. |
There was a problem hiding this comment.
Perhaps clarify that this is only used for book keeping and doesn't do data transfer.
There was a problem hiding this comment.
updated, thanks!
| } | ||
|
|
||
| @Override | ||
| public List<Runnable> shutdownNow() { |
There was a problem hiding this comment.
Perhaps add a test for shutdown being a no-op?
| this.isSharedScheduledExecutorEnabled = | ||
| PropertyUtil.propertyAsBoolean( | ||
| properties, | ||
| S3_SHARED_SCHEDULED_EXECUTOR_ENABLED, | ||
| S3_SHARED_SCHEDULED_EXECUTOR_ENABLED_DEFAULT); | ||
| this.sharedScheduledExecutorPoolSize = | ||
| PropertyUtil.propertyAsInt( | ||
| properties, | ||
| S3_SHARED_SCHEDULED_EXECUTOR_POOL_SIZE, |
There was a problem hiding this comment.
Two catalogs can configure these with separate values, right? But we have one JVM-wide singleton property. That means that the tunable for the one of the catalog will be silently ignored. Is that acceptable?
There was a problem hiding this comment.
Good point. Since this is a JVM-wide singleton, a per-catalog pool size tunable doesn't make sense. I'll remove the s3.shared-scheduled-executor.pool-size property and hard-code the pool size to 5 (the AWS SDK v2 default). The s3.shared-scheduled-executor.enabled toggle is still useful as an escape hatch.
…or across S3 clients
f7f50ad to
e6d956a
Compare
@anoopj Yes, I'll address the async client path in a follow-up PR to keep this one focused and reviewable. |


Summary
When
CachingCatalogevicts a table entry andGlueCatalogcreates a new S3Client on reload, the old S3Client'ssdk-ScheduledExecutorthreads are never shut down (threads are GC roots and cannot be reclaimed). In long-running applications like Spark Thrift Server, this causes unbounded thread accumulation leading to native OOM crash.Root Cause
When
ScheduledThreadPoolExecutorcreates its core threads, they inherit the parent thread'sinheritableThreadLocals. In Spark Thrift Server, the parent thread holdsSparkSession→SessionState→CatalogManager→SparkCatalog→FileIOTracker→ Caffeine cache →S3FileIO. Everysdk-ScheduledExecutorthread copies this chain, and since live threads are GC roots, the entire object graph is rooted — not just the threads, but also theS3FileIOinstances,S3Clients, and everything they reference.Fix
Share a single
ScheduledExecutorService(pool size 5, matching AWS SDK v2 default) across all S3Clients created by Iceberg's AWS client factories. The shared executor creates its threads only once at startup. S3FileIO instances are then held only via the Caffeine cache's normal entries — when evicted from the cache, they become unreachable and are collected normally.This eliminates both the thread leak and the object leak by preventing the creation of new GC-rooting threads that inherit ThreadLocals.
Closes #15898
Changes
applySharedScheduledExecutormethod toS3FileIOPropertiesthat injects a shared executor into the S3 client builderAwsClientFactories,AssumeRoleAwsClientFactory,LakeFormationAwsClientFactory,DefaultS3FileIOAwsClientFactory)NonClosingScheduledExecutorServiceto prevent the AWS SDK from shutting it down onS3Client.close()(the SDK'sAttributeMap.closeIfPossible()callsshutdown()on anyExecutorService)s3.shared-scheduled-executor.enabled(default:true) — opt-out switchTesting
Known Limitations
S3AsyncClient.builder()) is not covered. The default path uses CRT which has its own thread model and is unaffected. The standard async builder path (whens3.crt.enabled=false) has the same leak but is deferred to a follow-up.