Skip to content

feat(spark): add Spark 4.2 support#18621

Draft
yihua wants to merge 3 commits intoapache:masterfrom
yihua:spark-4.2-support
Draft

feat(spark): add Spark 4.2 support#18621
yihua wants to merge 3 commits intoapache:masterfrom
yihua:spark-4.2-support

Conversation

@yihua
Copy link
Copy Markdown
Contributor

@yihua yihua commented Apr 27, 2026

Describe the issue this Pull Request addresses

This PR adds support for Hudi on Spark 4.2, using the latest preview release (4.2.0-preview4).

Summary and Changelog

Add Spark 4.2 support to Apache Hudi, introducing a new hudi-spark4.2.x adapter module that handles API changes between Spark 4.1 and 4.2.

Dependency version updates (aligned with Spark 4.2.0-preview4):

  • Scala: 2.13.17 -> 2.13.18
  • Hadoop: 3.4.2 -> 3.4.3
  • Parquet: 1.16.0 -> 1.17.0
  • Jackson: 2.20.0 -> 2.21.2
  • ORC: 2.2.1 -> 2.3.0
  • Kafka: 3.9.1 -> 3.9.2
  • Log4j: 2.20.0 -> 2.25.4
  • Avro: 1.12.1 (unchanged)
  • SLF4J: 2.0.17 (unchanged)

API changes handled:

  • InsertIntoStatement: 7 args -> 9 args (added replaceCriteriaOpt, withSchemaEvolution)
  • UnresolvedFunction: ignoreNulls parameter changed from Boolean to Option[Boolean]
  • CharType/VarcharType: added collation parameter (fixed in shared code using type-based pattern matching)

Other:

  • Added HoodieSparkUtils.isSpark4_2 and gteqSpark4_2 version helpers
  • Added Spark4_2Adapter to SparkAdapterSupport
  • Added Spark 4.2 version dispatch in HoodieAnalysis for rules loaded via reflection
  • CI, Docker bundle validation, and release script support for Spark 4.2
  • Tests disabled on Spark 4.1+ (known issues) automatically carry forward to Spark 4.2

Impact

Adds Spark 4.2 as a supported engine version for Hudi.

Risk Level

Low

Documentation Update

Updated README.md and hudi-spark-datasource/README.md with Spark 4.2 build profile.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Approval
    • Attach the JIRA/Issue
  • Licensing
    • All new source files have Apache license header
    • Dependencies licenses are compatible with Apache license
  • Tests
    • Added or updated unit tests

@github-actions github-actions Bot added the size:XL PR with lines of changes > 1000 label Apr 28, 2026
@yihua yihua force-pushed the spark-4.2-support branch from 7c698e3 to b236759 Compare April 28, 2026 01:56
yihua added 2 commits April 27, 2026 21:34
Add Spark 4.2 support to Apache Hudi, introducing a new hudi-spark4.2.x
adapter module that handles API changes between Spark 4.1 and 4.2.

Dependency version updates (aligned with Spark 4.2.0-preview4):
- Scala: 2.13.17 -> 2.13.18
- Hadoop: 3.4.2 -> 3.4.3
- Parquet: 1.16.0 -> 1.17.0
- Jackson: 2.20.0 -> 2.21.2
- ORC: 2.2.1 -> 2.3.0
- Kafka: 3.9.1 -> 3.9.2
- Log4j: 2.20.0 -> 2.25.4
- lz4-java: org.lz4:1.8.0 -> at.yawk.lz4:1.10.4
- Avro: 1.12.1 (unchanged)
- SLF4J: 2.0.17 (unchanged)

API changes handled:
- InsertIntoStatement: 7 args -> 9 args (added replaceCriteriaOpt,
  withSchemaEvolution)
- UnresolvedFunction: ignoreNulls changed from Boolean to Option[Boolean]
- CharType/VarcharType: added collation parameter (fixed in shared code
  using type-based pattern matching)

Other:
- Added isSpark4_2 and gteqSpark4_2 version helpers
- Added Spark4_2Adapter to SparkAdapterSupport
- Added Spark 4.2 version dispatch in HoodieAnalysis for rules loaded
  via reflection
- Fixed lz4-java classpath conflict: Spark 4.2 relocated lz4-java from
  org.lz4 to at.yawk.lz4; made groupId/version configurable via
  properties to avoid duplicate classes on classpath
- CI, Docker bundle validation, and release script support for Spark 4.2
- TestMergeIntoTable2: error message changed from "Eagerly executed
  command failed" to "Executed command failed" in Spark 4.2
- TestMergeIntoTable: non-existent target table now throws
  SparkException wrapping AnalysisException instead of bare
  AnalysisException in Spark 4.2
@yihua yihua force-pushed the spark-4.2-support branch from efb40e8 to 999189b Compare April 28, 2026 04:34
In Spark 4.2, TABLE_OR_VIEW_NOT_FOUND is wrapped in a SparkException
whose cause message contains template variables instead of the expanded
error text. Check both the exception message and cause message for the
TABLE_OR_VIEW_NOT_FOUND error class.
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 45.60261% with 501 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.67%. Comparing base (642e1d3) to head (74990a7).

Files with missing lines Patch % Lines
...parquet/Spark42LegacyHoodieParquetFileFormat.scala 0.00% 248 Missing ⚠️
...park/sql/hudi/analysis/HoodieSpark42Analysis.scala 46.05% 25 Missing and 16 partials ⚠️
...ark/sql/HoodieSpark42CatalystExpressionUtils.scala 30.00% 3 Missing and 32 partials ⚠️
...k/sql/parser/HoodieSpark4_2ExtendedSqlParser.scala 61.33% 22 Missing and 7 partials ⚠️
...org/apache/spark/sql/adapter/Spark4_2Adapter.scala 59.32% 15 Missing and 9 partials ⚠️
...org/apache/hudi/Spark42HoodiePartitionValues.scala 12.50% 21 Missing ⚠️
...che/spark/sql/HoodieSpark42CatalystPlanUtils.scala 71.18% 13 Missing and 4 partials ⚠️
...ution/datasources/Spark42NestedSchemaPruning.scala 6.25% 11 Missing and 4 partials ⚠️
...n/datasources/parquet/Spark42DataSourceUtils.scala 0.00% 15 Missing ⚠️
...sql/hudi/Spark42ResolveHudiAlterTableCommand.scala 45.45% 2 Missing and 10 partials ⚠️
... and 9 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18621      +/-   ##
============================================
- Coverage     68.06%   64.67%   -3.39%     
+ Complexity    28919    23698    -5221     
============================================
  Files          2518     2068     -450     
  Lines        140570   119496   -21074     
  Branches      17416    15753    -1663     
============================================
- Hits          95680    77290   -18390     
+ Misses        37033    35049    -1984     
+ Partials       7857     7157     -700     
Flag Coverage Δ
common-and-other-modules ?
hadoop-mr-java-client 44.85% <ø> (+0.01%) ⬆️
spark-client-hadoop-common 48.43% <20.00%> (+<0.01%) ⬆️
spark-java-tests 48.04% <41.15%> (-0.61%) ⬇️
spark-scala-tests 44.32% <41.69%> (-0.39%) ⬇️
utilities 37.70% <5.26%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../main/scala/org/apache/hudi/HoodieSparkUtils.scala 64.46% <100.00%> (-10.41%) ⬇️
...di/Spark42HoodiePartitionCDCFileGroupMapping.scala 100.00% <100.00%> (ø)
.../hudi/Spark42HoodiePartitionFileSliceMapping.scala 100.00% <100.00%> (ø)
...park/sql/avro/HoodieSpark4_2AvroDeserializer.scala 100.00% <100.00%> (ø)
.../spark/sql/avro/HoodieSpark4_2AvroSerializer.scala 100.00% <100.00%> (ø)
...l/parser/HoodieSpark4_2ExtendedSqlAstBuilder.scala 18.81% <ø> (ø)
...main/scala/org/apache/hudi/SparkFilterHelper.scala 65.33% <0.00%> (ø)
...in/scala/org/apache/hudi/SparkAdapterSupport.scala 64.70% <33.33%> (-1.97%) ⬇️
...rg/apache/spark/sql/HoodieSpark42SchemaUtils.scala 50.00% <50.00%> (ø)
...ala/org/apache/hudi/Spark42HoodieFileScanRDD.scala 0.00% <0.00%> (ø)
... and 15 more

... and 828 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants