Skip to content

WIP: ReadRestrictions E2E POC with Iceberg Generics #16131

Draft
singhpk234 wants to merge 46 commits intoapache:mainfrom
singhpk234:feature/load-table-return-policy-wip-generics
Draft

WIP: ReadRestrictions E2E POC with Iceberg Generics #16131
singhpk234 wants to merge 46 commits intoapache:mainfrom
singhpk234:feature/load-table-return-policy-wip-generics

Conversation

@singhpk234
Copy link
Copy Markdown
Contributor

No description provided.

propose a new structure`
Flatten Projection into Action with an "action" discriminator property,
matching the BaseUpdate pattern. Each Action subtype now uses allOf to
extend Action with a const action value, making them distinguishable in JSON.
…lyExpression

- Replace 4 original actions (mask-hash-sha256, replace-with-null,
  mask-alphanumeric, apply-transform) with 10 predefined action types:
  mask-alphanum, mask-to-default, replace-with-null, show-first-4,
  show-last-4, truncate-to-year, truncate-to-month, sha-256-global,
  sha-256-query-local, apply-expression
- Add discriminator mapping for all action subtypes
- Add detailed descriptions for each action including type applicability,
  encoding rules for SHA-256 variants, and type-specific defaults
- Replace Term-based ApplyTransform with Expression-based ApplyExpression
- Update required-column-projections rules for action model
- Regenerate Python from updated YAML spec
- Mark ApplyExpression.expression as required
- Fix SHA-256 binary output type from binary(32) to binary
- Add NULL-in/NULL-out global rule for all actions
- Specify Unicode code point semantics for masking actions
- Add signed two's complement for SHA-256 int/long output
- Add UTC truncation rule for timestamptz types
- Clarify empty ReadRestrictions equivalence and server type validation
- Specify minimum 16-byte salt for sha-256-query-local
- Clarify unrecognized action types trigger MUST-fail
- Add "Applicable to: all data types" for ApplyExpression
- Regenerate Python from updated YAML spec
SHA-256 digest truncated to 16 bytes does not produce valid RFC 4122
UUIDs (version and variant bits are not set), so uuid is removed from
the input encoding, output encoding, and applicable types for both
sha-256-global and sha-256-query-local actions.
Collapse 10 Spark Catalyst expression classes into a single generic
IcebergRestricted wrapper + an Actions.bind(Action, Type) factory
that returns SerializableFunctions. Engine integrations (Spark,
Flink, Trino) can now reuse the spec-defined masking semantics
instead of reimplementing them per-engine.

Follows the existing Transforms/Expressions bind pattern. Also fixes
row-filter evaluation order so filters run on original values before
masks are applied, matching the spec.
The spec mandated 9999-12-31T00:00:00 as the mask-to-default sentinel
for timestamp_ns and timestamptz_ns, but that value overflows a 64-bit
signed integer as nanoseconds from epoch (max representable is
~2262-04-11). Engines MUST use exactly the listed default, so specifying
an unrepresentable value broke cross-engine consistency.

Use 2261-12-31T00:00:00 as the closest clean far-future sentinel that
fits. Add a note explaining the overflow so implementers don't mistake
it for a typo.
Introduces the ReadRestrictions data model (Action, ReadRestrictions,
ReadRestrictionsAware), JSON parsers (ActionParser, ReadRestrictionsParser),
and the LoadTable response plumbing that threads server-provided restrictions
through BaseTable and RESTTable into the Spark SparkTable. Pairs with the
Actions factory and Spark IcebergRestricted expression already on this branch
to deliver end-to-end enforcement.
…lter

Drop CodegenFallback from both expressions and implement doGenCode:

- IcebergRestricted: emit inline Java for child eval, Spark-to-Iceberg
  type conversion, mask.apply via addReferenceObj, and Iceberg-to-Spark
  type conversion. Primitive cases collapse into one branch via
  CodeGenerator.boxedType + primitiveTypeName. ByteBuffer and Decimal
  conversions route through a companion IcebergRestrictedCodegenSupport
  object to sidestep Janino's limited covariant-return handling.

- IcebergRowFilterExpr: delegate doGenCode to the child's genCode so
  the filter participates in the whole-stage loop while prettyName,
  sql, simpleString, and toString keep EXPLAIN output opaque.

Add seven codegen tests exercising the compiled path (asserting the
UnsafeProjection is not the Interpreted fallback) across String,
Integer, Long, Binary, Decimal, null input, and RowFilter.
The rule currently rewrites only top-level column projections; an action
with a fieldId pointing to a nested field (or to a fieldId absent from
the current schema) would otherwise be silently dropped, letting the
unmasked column through. Validate up front and throw IllegalStateException.
The spec permits actions on any fieldId; this is a temporary rule-level
limitation, not a spec restriction, and the comment notes to lift it
when nested struct-path rewriting is implemented.
First step in migrating read-restriction masking from custom
IcebergRestricted expressions (hand-rolled doGenCode) toward Spark's
native ScalarFunction + ApplyFunctionExpression path, which gets
whole-stage codegen for free. Mirrors the precedent in
ReplaceStaticInvoke that constructs ApplyFunctionExpression
programmatically inside a Catalyst rule.

Adds MaskAlphanumFunction following the BucketFunction / TruncateFunction
shape: UnboundFunction.bind(StructType) returns a BoundMaskAlphanum
with a static invoke(UTF8String) magic method for codegen, delegating
to Actions.bind(MaskAlphanum, STRING) for the canonical masking
semantics so eval and codegen paths share one implementation and
Flink/Trino can reuse the core factory unchanged.

Registers the function as system.iceberg_mask_alphanum in SparkFunctions.

ApplyReadRestrictions now routes mask-alphanum actions through
ApplyFunctionExpression; all other 9 action types fall through to the
existing IcebergRestricted path unchanged so this ships as a non-
breaking incremental step. The remaining actions will migrate in
follow-up commits (Sha256/Mask-to-default/Truncate each need per-type
bound classes).
Capture the three reasons — native whole-stage codegen via the static
invoke magic method, forward-compatibility with a future Spark table
API that could publish masking expressions as V2 function calls (same
shape used for CHECK constraints today), and cross-engine reuse of
Actions.bind in core — on the MaskAlphanumFunction class javadoc so
future contributors have the rationale in one place as more masks
migrate off IcebergRestricted.
…pat parser

Three of five reviewer-flagged items from the Anton/Ryan synthesis:

1. Salt out of the Catalyst resolution inner loop
   ApplyReadRestrictions previously generated a new SecureRandom salt
   inside the per-attribute match inside resolveOperators, so fixed-point
   re-entry could regenerate the salt mid-planning and break determinism
   within a query. Generate the salt once per apply() call and thread it
   through buildMaskExpression so all Sha256QueryLocal actions in the
   rewrite share a stable salt.

2. Rename ReadRestrictionsAware -> SupportsReadRestrictions
   Match the capability-marker naming used elsewhere in Iceberg
   (SupportsDistributedScanPlanning, SupportsReplaceView, etc.). Centralize
   the sniff in TableUtil.readRestrictions(Table) following the
   TableUtil.formatVersion precedent (apache#11620), so engines have one decision
   locus and future Flink/Trino integrations don't re-invent the check.

3. Forward-compat for unknown action discriminators
   ActionParser used to throw IllegalArgumentException on an unrecognized
   action type, blocking older clients from interoperating with newer
   servers. Introduce Action.Unknown that preserves (fieldId, rawTypeString)
   so parsing is lossless; enforcement (Actions.bind, rule binding) fails
   closed when it encounters Unknown so silent bypass of unmasked data is
   impossible. Mirrors ImmutableUnknownViewRepresentation in the view spec.
   The existing 'unknownActionTypeRejected' test is renamed and updated to
   assert preservation instead of rejection.

Deferred to follow-up PRs:
  - Bind row filter at parse time (needs schema plumbing in LoadTableResponseParser)
  - TestApplyReadRestrictions + TestReadRestrictedQuery (needs SparkSession fixture)
Spec (ReadRestrictions description in open-api/rest-catalog-open-api.yaml:3500):

    For all actions, if the input column value is NULL, the output MUST be NULL.

MaskToDefaultFunction was returning the type-specific default for null input
(e.g., 999999999 for int, the empty byte buffer for binary) which violates
that clause. Guard the null case to return null first, matching what every
other action implementation does.

Also update the stale TestActions expectation encoded as
maskToDefaultNullReturnsDefault, renaming to maskToDefaultNullReturnsNull
and referencing the spec clause in a comment.

Fix the class javadoc on ApplyReadRestrictions that described the rewrite
shape inside-out (had Filter-above-Project, actual code is Project-above-
Filter which is what the spec requires: row filters see original values,
projections apply after).

No behavior change for any non-null input. No wire-format change.
…me salt vars

Three cleanups from a /simplify sweep:

1. Extract NullSafeFunction base class in Actions.
   11 of the 13 inner function classes started with an identical
   'if (value == null) return null;' preamble enforcing the spec clause
   that requires null pass-through. Lift this into an abstract base
   that declares applyNonNull(Object); each function implements the
   non-null path, the base adds the guard once. ReplaceWithNullFunction
   (always returns null) and ApplyExpressionFunction (always throws)
   stay as direct SerializableFunction impls since their semantics
   predate and skip the guard.

2. Lock in the cross-file invariant with a bind test.
   Actions.bind(Action.Unknown, type) must throw — this is what keeps
   forward-compat parser preservation safe. Previously only documented
   in a comment; now a test in TestActions prevents regressions.

3. Rename salt variables for clarity.
   In ApplyReadRestrictions, the outer query-scoped salt is now
   querySalt and the per-action narrowing is actionSalt. Avoids the
   subtle shadow where salt (query-wide) was threaded through then a
   local salt (for one action) was assigned from it.

No behavior change; 28 core tests and 18 spark-extensions tests green.
Three blockers on PR apache#16082 CI:

1. api/.../SupportsReadRestrictions.java:30 javadoc error
   @link pointed at org.apache.iceberg.TableUtil#readRestrictions,
   but iceberg-api does not depend on iceberg-core so the reference
   can't resolve. Inline the name as @code text instead.

2. spark-extensions/.../ApplyReadRestrictions.scala:138
   Line exceeded 120 chars in the buildMaskExpression javadoc where
   a fully-qualified @link to Spark's ScalarFunction was inline.
   Break the link onto its own line.

3. spark-extensions/.../ApplyReadRestrictions.scala:96
   scalastyle's 'If block needs braces' on the if-else inside
   throw new IllegalStateException(...). Pull the condition out and
   throw from both branches explicitly.

Also fix scalastyle import order in IcebergRestrictionExpressions.scala
(CodegenContext swapped with CodeGenerator alphabetically).
ReadRestrictions are currently REST-sourced exclusively. The types were
spread across api/restrictions/ (data + marker) and core/restrictions/
(parsers + bind factory), and BaseTable advertised the capability
interface on every table — even Hive/Hadoop-loaded ones that can never
produce restrictions.

Mirror the credentials precedent (Credential, CredentialParser live
entirely in core/rest/credentials/) and collapse everything into a
single core/rest/restrictions/ package:

  api/restrictions/                    ->  core/rest/restrictions/
    Action.java
    ReadRestrictions.java
    SupportsReadRestrictions.java
  core/restrictions/                   ->  core/rest/restrictions/
    Actions.java
    ActionParser.java
    ReadRestrictionsParser.java
  core/src/test/.../restrictions/      ->  core/src/test/.../rest/restrictions/
    TestActions.java
    TestReadRestrictionsParser.java

Narrow the capability:

  BaseTable keeps the ReadRestrictions field (REST callers pass it in
  via the 4-arg constructor) but no longer implements
  SupportsReadRestrictions. Only RESTTable advertises the interface,
  inheriting the readRestrictions() method from BaseTable to satisfy
  it. Instanceof checks now correctly discriminate REST-loaded tables
  from non-REST BaseTable subclasses — closing Anton's "capability
  spread too widely" review finding.

No wire-format change, no test logic change — only import paths in
tests moved to match the new package. All existing tests pass (core
restrictions tests, spark-extensions restriction expression tests,
checkstyle, scalastyle, spotless, api javadoc).
…core root

The marker is a table-capability advertisement: 'this loaded Table instance
carries per-principal restrictions.' Living under core/rest/restrictions/
implies REST-only applicability and buries it where engine authors wouldn't
look. Move it to org.apache.iceberg (core root), alongside how
SupportsDistributedScanPlanning lives at api root — both are Table-shape
capabilities, not transport details.

BaseTable deliberately does NOT implement the marker even though it carries
the field: that's the asymmetry from the earlier 'capability spread too
widely' review — every Hive/Hadoop-loaded BaseTable would otherwise
advertise a capability it can never fulfill. Keep the data on BaseTable as
opaque state so RESTSessionCatalog's LOCAL-mode fallback still threads
restrictions through, but only RESTTable advertises the marker via
'implements SupportsReadRestrictions'. The inherited public
readRestrictions() method on BaseTable satisfies the interface contract
when RESTTable (or future REST-loaded subclasses) declare the marker.

Consumer code now reads:

  if (table instanceof SupportsReadRestrictions r) {
    r.readRestrictions().ifPresent(restrictions -> ...);
  }

The data types (ReadRestrictions, Action, Actions factory, parsers) stay
in core/rest/restrictions/ — they remain REST-authored, mirroring how
Credential and CredentialParser live entirely in core/rest/credentials/.
The previous shape had a semantic lie in the type system: BaseTable carried
a ReadRestrictions field (to support the LOCAL scan-planning mode fallback
in RESTSessionCatalog) but deliberately did not implement
SupportsReadRestrictions. Only RESTTable — constructed in SERVER
scan-planning mode — advertised the marker. The default LOCAL mode
returned BaseTable, so 'instanceof SupportsReadRestrictions' quietly
returned false and restrictions were dropped.

Fix by introducing BaseRESTTable as an intermediate class in core/rest/:

  BaseTable      (core, generic; no restriction concept at all)
    └─ BaseRESTTable      (core/rest; field + implements marker)
         └─ RESTTable     (core/rest; adds scan planning)

BaseTable goes back to its pre-restrictions shape — removing the 4-arg
constructor, the field, and the readRestrictions() method. Non-REST
catalogs (Hadoop, Hive, Glue, JDBC, Nessie) that construct BaseTable
no longer advertise a capability they have no pathway to produce.

BaseRESTTable carries the field and implements SupportsReadRestrictions
directly. RESTSessionCatalog's three fallback sites (line 598 for LOCAL
loadTable, line 747 for commit, line 1020 for another commit path) now
construct BaseRESTTable instead of BaseTable, so restrictions flow
correctly in all modes.

RESTTable extends BaseRESTTable and inherits both the field and marker —
drops the redundant 'implements SupportsReadRestrictions' declaration.

Consumer code is unchanged:
  if (table instanceof SupportsReadRestrictions r) {
    r.readRestrictions().ifPresent(...);
  }

Now returns true only for REST-loaded tables, never for Hive/Hadoop/etc.
/simplify review flagged two things:

1. BaseRESTTable is instantiated only by RESTSessionCatalog and extended
   only by RESTTable — both in the same org.apache.iceberg.rest package.
   Public visibility leaks an implementation detail and enlarges the
   public API surface without reason. Matches RESTTable, RESTTableOperations,
   and RESTTableCache — all package-private. Narrow both class and
   constructor to package-private.

2. SupportsReadRestrictions javadoc referenced 'TableUtil.readRestrictions
   (Table) (also in iceberg-core)' — the parenthetical is redundant since
   any reader is already in iceberg-core. Drop it.

Other flagged items were investigated and skipped: the Optional allocation
and filter lambda in readRestrictions() are query-planning-time, not
per-row, so allocation cost is negligible; the nullable readRestrictions
parameter is intentional and handled correctly by Optional.ofNullable;
no existing utility duplicates were found.
'class MaskAlphanum extends Base' tells you nothing at the call site —
Base of what? Iceberg's convention across the codebase is Base<Noun>
(BaseTable, BaseMetastoreTableOperations, BaseMetadataTable,
BaseSessionCatalog, BaseUpdate, BaseReadOnlyTable, and the just-added
BaseRESTTable). Follow that convention: 'extends BaseAction' reads as
a proper Iceberg base class, and matches how every similarly-abstract
shared-state holder is named in the project.

No external references to Action.Base — the only references were the
10 subclasses inside Action.java itself.
- Defensive-copy salt array in Sha256Function to prevent caller mutation
- Return ByteBuffer.duplicate() per row in MaskToDefaultFunction to
  prevent shared-position corruption across rows
- Validate field-id > 0 at parse boundary in ActionParser
- Fix IcebergRowFilterExpr.nullable to delegate to child.nullable
- Guard ApplyReadRestrictions with TreeNodeTag to prevent double-rewrite
  on fixed-point iterations and ensure salt stability per query
- Single-pass code-point iteration in ShowFirst4/ShowLast4 (drop
  toCodePoints helper)
- Use DateTimeUtil for date truncation functions
- Use MessageDigest.update(ByteBuffer) directly for BINARY sha-256
- Cache maskedFieldIds in ReadRestrictions constructor
- Normalize BaseRESTTable.readRestrictions at construction time
- Remove redundant isEmpty guard in LoadTableResponseParser
- Remove .claude/plan.md from tracked files
Merges the separate Action (data model) and Actions.bind() (behavior
factory) hierarchies following the Transform<S, T> pattern in
org.apache.iceberg.transforms.

- Action<S, T> is now generic with bind(Type), bind(Type, byte[]) and
  canBind(Type) default methods
- Each concrete action subclass implements its own bind() and carries
  type-specific inner SerializableFunction classes
- Six TruncateTo* classes reduced to two outer actions with three shared
  abstract bases (TruncateDateFn / TruncateTimestampFn / TruncateTimestampNanoFn)
- Sha256Function split into four type-specialized subclasses
  (Sha256String / Sha256Integer / Sha256Long / Sha256Binary) sharing a
  Base that holds the salt and digest ThreadLocal — one switch at bind
  time instead of two per-row switches
- MaskToDefault bind-time dispatch: ByteBuffer-valued types return a
  function that duplicates per call, non-ByteBuffer types return a
  simple constant
- LocalTime.MIDNIGHT replaces the chained withHour/withMinute/withSecond/withNano
- Actions.java shrinks from ~460 lines to ~60 lines of shared helpers
  (NullSafeFunction base + mapCodePoint)
- ReadRestrictions.columnProjections() now returns List<Action<?, ?>>

Callers migrate from Actions.bind(action, type, salt) to
action.bind(type, salt). TestActions, TestReadRestrictionsParser,
TestIcebergRestrictionExpressions, MaskAlphanumFunction, and
ApplyReadRestrictions are updated accordingly.
… date util

- ApplyExpression.canBind now returns true to match bind() behavior
  (bind succeeds, apply throws; canBind was contradicting this)
- Sha256String.update casts to String instead of calling toString()
  (value is guaranteed String at this point; documents the invariant)
- TestActions maskToDefaultDate uses DateTimeUtil.daysFromDate for
  consistency with the production code's DATE_DEFAULT computation
Thread server-provided ReadRestrictions through the engine-agnostic
IcebergGenerics.read() path. Row filters are AND'd into the TableScan so
manifest-level pruning still applies; column masks are applied per-record
after reads via the existing Actions.bind() factory.

This provides a reference implementation of spec enforcement for any
consumer of the generics reader (Flink, Trino test harnesses, standalone
tools) without requiring engine-specific Catalyst rules. Top-level fields
only — nested masks fail closed at bind time so unmasked data cannot leak.
The applier now performs row filtering against the original column values
before applying column masks, matching the spec ordering requirement.
Row-level enforcement no longer depends on the surrounding reader honoring
TableScan residuals — IcebergGenerics still pushes the filter into the
scan for partition/stats pruning, but correctness is guaranteed by this
layer regardless of reader behavior.
Dropping the Spark-side Catalyst rule, ScalarFunction, and opaque mask
expressions. This PR now contains only the engine-agnostic core-level
enforcement via ReadRestrictionsApplier in the data module.

The Spark-specific work remains on feature/load-table-return-policy-wip
(PR apache#16082) and can be layered on top once this lands.
Action.bind(Type, byte[]) is now polymorphic per subclass; the central
Actions.bind(Action, Type, byte[]) dispatcher was removed when Actions
was reduced to package-private helpers. Update the generics-reader
applier to call action.bind(type, salt) directly and widen the element
type to Action<?, ?>.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants