Spark: Add session-level split size override#16154
Open
gerashegalov wants to merge 7 commits intoapache:mainfrom
Open
Spark: Add session-level split size override#16154gerashegalov wants to merge 7 commits intoapache:mainfrom
gerashegalov wants to merge 7 commits intoapache:mainfrom
Conversation
Signed-off-by: Gera Shegalov <gshegalov@nvidia.com>
… configurations - Removed the session configuration for split size from SparkReadConf and SparkSQLProperties. - Updated SparkReadConf documentation to clarify the precedence of table-scoped session configurations over global settings. - Added tests to verify that table-scoped session configurations take precedence over global configurations and that options take precedence over table-scoped configurations. Signed-off-by: Gera Shegalov <gshegalov@nvidia.com>
…ation for scan planning
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #16153
What changes were made in this PR?
Add a new Spark session configuration key
spark.sql.iceberg.split-sizethat allows overridingthe
read.split.target-sizetable property at the session level without requiring DDL changesto table metadata or source code changes to read call sites.
This is particularly useful when GPU and CPU workloads read the same Iceberg table
concurrently: GPU sessions benefit from significantly larger splits (e.g. 2GB) while CPU
sessions perform better with the default 128MB. Hardware accelerators like
RAPIDS Accelerator for Apache Spark are designed as
drop-in replacements requiring no application code changes, so a session-level knob is essential.
Changes
All Spark shims (v3.4, v3.5, v4.0):
SparkSQLProperties: addSPLIT_SIZE = "spark.sql.iceberg.split-size"constantSparkReadConf: add.sessionConf(SparkSQLProperties.SPLIT_SIZE)to bothsplitSize()andsplitSizeOption()parser chains; update Javadoc to document 5-level precedenceSparkConfParser: storeTable.name()astableNameand inConfParser.parse()try atable-qualified session key (
<key>.<tableName>) before the global session keyv3.5 only:
TestSparkWriteConf: add 4 tests for table-scoped session conf resolutionResolution precedence
split-size)spark.sql.iceberg.split-size.<catalog>.<db>.<table>)spark.sql.iceberg.split-size)read.split.target-size)How was this patch tested?
4 new unit tests in
TestSparkWriteConf(v3.5):