WIP: V4 Adaptive Metadata Tree Prototype#16150
Draft
anoopj wants to merge 19 commits intoapache:mainfrom
Draft
Conversation
…leteFile APIs This adapter would allow to minimize the v4 related code changes during scan planning and commits.
…et by Default Extends V4 Manifest writer to allow it to write manfiests in either Parquet or Avro based on the file extension. A default is also added to do Parquet Manifests in the SDK when the Version is 4. This could be parameterized later but that will requrie parameterizing the test suites so I decied on a single format (parquet) for now. There are a few other requried changes here outside of testing 1. Handling of splitOffsets in Parquet needs to be changed since BaseFile returns an immutable view which Parquet was attempting to re-use by clearing. 2. Unpartitioned Tables need special care since parquet cannot store empty structs in the schema. This means reading from parquet manfiests means skipping the parquet field and then changing read offsets if the partition is not defined. The read code is shared between all versions at this time so this change effects older avro readers as well. 3. Some of the tests code for TestReplacePartitions assumed that you could validate against a slightly different vesrion of the table. This is a problem if the table you make is partitioned and the validation table is unpartitioned. It use to work ... accidently I think because we would make unpartitioned operations committed to a partitioned table.
- ManifestReader: Mark partition field optional for unpartitioned tables instead of removing it from the projection, preserving positional access and avoiding ClassCastException from shifted ordinals - BaseFile: Deep copy ByteBuffer values in copyByteBufferMap to prevent Parquet container reuse from corrupting bounds in copied files, which caused equality deletes to fail stats-based overlap checks - BaseFile: Guard against null partition value in internalSet - TestRewriteTablePathsAction: Simplify manifest file predicate to use name patterns instead of file extensions
- Collapse broken builder chain in ManifestReader.open() into a single fluent expression - Extract manifest format determination in SnapshotProducer into a private field computed once in the constructor - Replace magic format version 4 with TableMetadata.MIN_FORMAT_VERSION_PARQUET_MANIFESTS in tests - Parameterize TestManifestFileUtil across all format versions - Fix TestJdbcCatalog.manifestFiles to use exclusion filter instead of allowlisting file extensions - Improve ParquetValueReaders container reuse comments to reference specific BaseFile fields
Replace instanceof-then-cast with Java 16+ pattern matching to eliminate redundant casts in outputFile() and keyMetadataBuffer().
…test names - ParquetValueReaders: only skip recycling reuse as scratch buffer for Guava ImmutableList / ImmutableMap - BaseFile: factor ByteBuffer map deep copy into deepCopyByteBufferMap - V4Metadata: build file schema fields with ImmutableList.builderWithExpectedSize - TestSnapshotProducer: rename Avro manifest compression tests for clarity
…ch reuse Reuse ArrayList/LinkedHashMap-style buffers only via instanceof; avoids Class.forName and non-API JDK type checks while keeping clear() safe.
For v4 tables, SnapshotProducer now writes a Parquet root manifest containing TrackedFile entries with content_type=DATA_MANIFEST instead of an Avro manifest list. BaseSnapshot detects Parquet format and reads root manifests via V4ManifestReader, converting entries back to ManifestFile objects for compatibility with the existing pipeline.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
WIP PR for s.apache.org/iceberg-single-file-commit
Works end to end from Spark, including scan planning and fast appends.
Not implemented in this PR: