Skip to content

feat(common): roll over commit metadata to clean#18590

Open
kbuci wants to merge 5 commits intoapache:masterfrom
kbuci:clean-rolling-metadata-only
Open

feat(common): roll over commit metadata to clean#18590
kbuci wants to merge 5 commits intoapache:masterfrom
kbuci:clean-rolling-metadata-only

Conversation

@kbuci
Copy link
Copy Markdown
Contributor

@kbuci kbuci commented Apr 25, 2026

Describe the issue this Pull Request addresses

When rolling metadata is configured (hoodie.write.rolling.metadata.keys), important metadata like schema and checkpoint keys are carried forward across commits. However, clean instants do not participate in this rolling mechanism — they neither receive rolled-over metadata nor serve as a source for subsequent lookups. After archival removes old ingestion commits, if only clean instants remain on the active timeline between surviving commits, the chain of rolled-over metadata can break.

This PR ensures that clean commits also carry rolled-over metadata in their extraMetadata field, preserving the rolling metadata chain across archival.

Summary and Changelog

  • Added a new mergeRollingMetadata(HoodieTable, HoodieWriteConfig, HoodieCleanMetadata) overload in BaseHoodieClient for clean commits, alongside the existing mergeRollingMetadata(HoodieTable, HoodieCommitMetadata) for regular commits.
  • Extracted a shared walkTimelineForRollingKeys private helper that walks backwards through all completed instants (commits, delta-commits, replace-commits, clustering, and clean) in a single unified pass, reading HoodieCommitMetadata or HoodieCleanMetadata as appropriate based on the instant's action type. Both overloads delegate to this helper.
  • Empty-string metadata values are now treated as "missing" during the walkback, ensuring operations like clustering or delete_partition that write empty schema strings don't block propagation.
  • CleanActionExecutor.runClean() now calls BaseHoodieClient.mergeRollingMetadata(table, config, metadata) inside the state-change lock (before writeTableMetadata), matching the same locking contract used for commit metadata.
  • Added testRollingMetadataPreservedInCleanCommits functional test that inserts records, upserts several times, runs clean, and verifies the completed clean instant's extraMetadata contains the rolled-over schema key.

Impact

No public API changes. The only user-visible change is that clean instants will now contain rolled-over metadata keys in their extraMetadata when hoodie.write.rolling.metadata.keys is configured. This is additive — clean instants that previously had no rolling metadata will now carry it. No behavior change when rolling metadata is not configured.

Risk Level

Low. The rolling metadata merge for clean is gated behind config.getRollingMetadataKeys().isEmpty(), so there is no impact when rolling metadata is not configured (the default). The unified timeline walk is a refactor of existing logic with the addition of clean instant handling.

Documentation Update

None. The hoodie.write.rolling.metadata.keys config already exists; this PR extends its coverage to clean instants without introducing new configs.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label Apr 25, 2026
* <p>Keys already present with a non-empty value in {@code existingExtra} are skipped (empty
* strings are treated as "missing").
*/
private static Map<String, String> walkTimelineForRollingKeys(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseTimelineForRollingMetadata

String action = instant.getAction();
Map<String, String> extraMeta = null;

if (HoodieTimeline.COMMIT_ACTION.equals(action)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add something like

Set<String> VALID_ACTIONS_FOR_ROLLING_METADATA = CollectionUtils.createSet(COMMIT_ACTION, DELTA_COMMIT_ACTION,
      CLEAN_ACTION,
      COMPACTION_ACTION, LOG_COMPACTION_ACTION, REPLACE_COMMIT_ACTION, CLUSTERING_ACTION);

to HoodieTimeline class and use it here for filtering.

log.debug("Walking back up to {} commits to find rolling metadata for keys: {}",
lookbackLimit, remainingKeys);
log.debug("Walking back up to {} instants to find rolling metadata for keys: {}", lookbackLimit, remaining);
int walked = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commitsWalkedBack

log.debug("All rolling metadata keys are present in current commit. No walkback needed.");
return;
}
Map<String, String> found = new HashMap<>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foundRollingMetadata

);
}
this.txnManager.beginStateChange(Option.of(inflightInstant), Option.empty());
// Merge rolling metadata inside the state-change lock so we read the latest timeline,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move L219 (where we reload the timeline) to here i.e. w/n the lock

"Clean's extraMetadata should contain rolled-over schema key");
assertFalse(cleanExtraMetadata.get(schemaKey).isEmpty(),
"Rolled-over schema in clean should be non-empty");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after this, can we add just 1 additional commit, but set the lookback window to just 1.
and validate the metadata gets rolled over.

@kbuci kbuci force-pushed the clean-rolling-metadata-only branch from c7730c7 to 138453d Compare April 28, 2026 00:05
@kbuci kbuci marked this pull request as ready for review April 28, 2026 00:06
Copy link
Copy Markdown

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The change extends the rolling-metadata mechanism to clean instants so the chain isn't broken when only cleans remain on the active timeline between commits after archival. One behavior change in the timeline walkback logic worth double-checking in the inline comment. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. Two naming nits in the new parseTimelineForRollingMetadata helper — the method verb and a counter variable that still says "commits" after the scope was widened to include clean instants.

return;
for (String key : rollingKeys) {
if (existingExtra.containsKey(key) && !StringUtils.isNullOrEmpty(existingExtra.get(key))) {
remaining.remove(key);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The previous code limited the lookback over getCommitsTimeline().filterCompletedInstants() (already pre-filtered to commit-type actions), but this now limits over the full filterCompletedInstants() and then skips invalid actions inside the loop. That means rollback / savepoint / restore / indexing instants in the recent window will consume the lookbackLimit budget without contributing — and on a busy table with frequent rollbacks/savepoints they could push valid commits out of the window entirely. Was that intended? It might be safer to filter to VALID_ACTIONS_FOR_ROLLING_METADATA first, then apply .limit(lookbackLimit).

- AI-generated; verify before applying. React 👍/👎 to flag quality.

}
}
if (remaining.isEmpty()) {
log.debug("All rolling metadata keys already present. No walkback needed.");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: parse typically implies deserializing raw bytes into a structure — could you rename this to something like collectRollingMetadataFromTimeline or gatherRollingMetadata? The Javadoc itself says "walks backwards... extracting", which suggests a traversal/collection verb fits better.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

continue;
}

if (HoodieTimeline.CLEAN_ACTION.equals(action)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: this counter now tracks both commit and clean instants, but commitsWalkedBack still says "commits" — the surrounding log messages already use "instants". Something like instantsRead or instantsWalked would stay consistent.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Copy link
Copy Markdown

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! The PR extends rolling-metadata propagation to clean instants and moves the timeline reload into the state-change lock so the clean walkback sees the same view as the commit path. The locking contract and the extraMeta == null handling both look right. One transition/edge-case question worth double-checking is in the inline comment around how the lookback budget is consumed on tables being upgraded. Please take a look at the inline comment, and this should be ready for a Hudi committer or PMC member to take it from here. One placement nit on the new static mergeRollingMetadata overload — otherwise the refactor reads cleanly.


log.debug("Walking back up to {} commits to find rolling metadata for keys: {}",
lookbackLimit, remainingKeys);
log.debug("Walking back up to {} instants to find rolling metadata for keys: {}", lookbackLimit, remaining);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Now that CLEAN_ACTION is in VALID_ACTIONS_FOR_ROLLING_METADATA, clean instants count against lookbackLimit even when they don't yet carry the key. On a table upgraded to this version, pre-existing clean instants written before this change have no rolling metadata in extraMetadata, so a small lookback (e.g. 1–2) can be entirely consumed by those legacy cleans and miss the key on a surviving older commit. With the prior getCommitsTimeline() filter, cleans were skipped and the budget was always spent on commit-type instants. Could you confirm this transition behavior is acceptable, or consider either (a) bumping the budget when the visited instant has no extraMeta, or (b) documenting that users may need to re-tune lookbackCommits after upgrade?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

// Reload inside the lock so mergeRollingMetadata reads the latest timeline,
// matching the same contract as mergeRollingMetadata for commit metadata.
table.getMetaClient().reloadActiveTimeline();
BaseHoodieClient.mergeRollingMetadata(table, config, metadata);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the Javadoc on the new static overload notes it's public static only because CleanActionExecutor doesn't extend BaseHoodieClient — that's a hint the rolling-metadata logic wants to live in a dedicated helper (e.g. RollingMetadataUtils) rather than as a static method hanging off the client base class. Both call sites would then read more naturally and the instance/static split on BaseHoodieClient goes away.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.58209% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.06%. Comparing base (642e1d3) to head (61d52a7).

Files with missing lines Patch % Lines
.../java/org/apache/hudi/client/BaseHoodieClient.java 82.53% 6 Missing and 5 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18590      +/-   ##
============================================
- Coverage     68.06%   68.06%   -0.01%     
- Complexity    28919    28926       +7     
============================================
  Files          2518     2518              
  Lines        140570   140604      +34     
  Branches      17416    17422       +6     
============================================
+ Hits          95680    95702      +22     
- Misses        37033    37045      +12     
  Partials       7857     7857              
Flag Coverage Δ
common-and-other-modules 44.36% <8.95%> (+<0.01%) ⬆️
hadoop-mr-java-client 44.81% <8.95%> (-0.03%) ⬇️
spark-client-hadoop-common 48.43% <70.14%> (+<0.01%) ⬆️
spark-java-tests 48.64% <79.10%> (-0.01%) ⬇️
spark-scala-tests 44.69% <8.95%> (-0.01%) ⬇️
utilities 37.70% <8.95%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...java/org/apache/hudi/config/HoodieWriteConfig.java 89.97% <100.00%> (ø)
...e/hudi/table/action/clean/CleanActionExecutor.java 87.82% <100.00%> (+0.07%) ⬆️
...che/hudi/common/table/timeline/HoodieTimeline.java 100.00% <100.00%> (ø)
.../java/org/apache/hudi/client/BaseHoodieClient.java 90.71% <82.53%> (-3.84%) ⬇️

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for updating the config docs — the wording now correctly reflects that rolling metadata propagates through both commits and clean instants, and that the lookback budget covers both. The new diff only touches Javadoc on two ConfigProperty definitions, so the prior code-level comments (the readability suggestion to extract a RollingMetadataUtils helper, and the transition-behavior question on upgraded tables where pre-existing legacy clean instants consume lookback budget without contributing) are not addressed by this update and remain open for the author to consider. No new issues from this pass — please take a look at the still-open comments above, and this should be ready for a Hudi committer or PMC member to take it from here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants