docs: add AI-native lakehouse documentation by yihua · Pull Request #18591 · apache/hudi

yihua · 2026-04-26T00:30:45Z

Describe the issue this Pull Request addresses

Add documentation for Hudi's AI-native capabilities to attract AI/ML users to the project.

Summary and Changelog

Add a new "AI & Unstructured Data" docs section and an AI Quick Start guide:

AI Quick Start (under Getting Started): hands-on tutorial for image similarity search using VECTOR, BLOB, and Lance
AI-Native Lakehouse Overview: landing page positioning Hudi for AI workloads (RAG, image search, feature stores, etc.)
Vector Search: VECTOR type reference, hudi_vector_search TVF syntax, distance metrics, embedding model compatibility
Unstructured Data: BLOB type reference covering inline vs out-of-line storage and read_blob() function
Lance File Format: Lance integration guide with setup, architecture, and comparison to Parquet for AI workloads
Updated overview.mdx and sidebars.js to link the new pages

Impact

Documentation-only change. No code changes.

Risk Level

none

Documentation Update

This PR is the documentation update itself. Pages are added to the current (next) version only, not backported to any released version.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…e format Add new "AI & Unstructured Data" docs section and AI Quick Start guide covering VECTOR type, BLOB unstructured data, hudi_vector_search TVF, and Lance file format integration.

…onfigs - Add variant_type.md covering semi-structured data with VARIANT (DDL for Spark 4.0+/3.x, parse_json, shredding, cross-engine compat) - Update vector_search.md: element types (FLOAT/DOUBLE/INT8), batch TVF (hudi_vector_search_batch), correct column name (_hudi_distance), algorithm parameter, constraints section - Update blob_unstructured_data.md: add configuration reference (hoodie.read.blob.inline.mode, batching configs), managed field - Update ai_overview.md: add VARIANT section and expanded use case table - Update sidebars.js and overview.mdx to include VARIANT

voonhous · 2026-04-27T16:54:39Z

+
+:::tip
+The complete demo scripts (DataFrame API, SQL, and BLOB-only variants) are available in the
+[hudi-examples](https://github.com/apache/hudi/tree/master/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo) repository.


These are not available as of now, let's remind ourselves to prepare the demo as a prerequisite before publishing this to asf-site.

voonhous · 2026-04-27T17:00:19Z

+- Java 11+
+- Python 3.10 – 3.12 (Python 3.13+ is not yet supported)
+- Apache Spark 3.5.x
+- Hudi 1.2.0+ (with Spark bundle)
+
+### Python Dependencies
+
+```bash
+pip install pyspark==3.5.* pyarrow>=14.0.0 \
+  torch>=2.3.0 torchvision>=0.18.0 timm>=1.0.9 \
+  scikit-learn>=1.4.2 numpy>=1.26.0 pillow>=10.3.0 matplotlib>=3.8.0


NOTE: This is a non-blocker

Another idea is that we could translate everything in here into a TestContainer test too for validation / E2E.

One problem i usually encounter is that test-scripts and code-base are usually decoupled. i.e. It is written at a point in time that corresponds to a snapshot of the state of our code.

We don't do a sanity check after changing our code, so, many things break and quickstarts no longer work after awhile.

voonhous · 2026-04-27T17:07:06Z

+| Property | Default | Description |
+|:---------|:--------|:------------|
+| `hoodie.read.blob.inline.mode` | `CONTENT` | Controls how INLINE BLOBs are read. `CONTENT` materializes raw bytes in the `data` column. `DESCRIPTOR` surfaces `(position, size)` coordinates rewritten as OUT_OF_LINE references. |
+| `hoodie.blob.batching.max.gap.bytes` | `4096` | Maximum gap (in bytes) between consecutive byte ranges before they are merged into a single read. Larger values reduce I/O calls at the cost of reading some unused bytes. |
+| `hoodie.blob.batching.lookahead.size` | `50` | Number of rows to buffer for batch read detection. Larger values improve batching for sorted data but increase memory usage. |


There are some characteristics/nuance with DESCRIPTOR that i think @yihua summarised very nicely into a table:

**Storage (immutable):** | Storage type | `data` | `reference` | |---|---|---| | `INLINE` | bytes | `null` | | `OUT_OF_LINE` | `null` | `{path, offset, length, managed}` | **`SELECT blob_col` (no `read_blob()`) — controlled by `hoodie.read.blob.inline.mode`:** | Mode | Inline blob behavior | Out-of-line (unaffected) | |---|---|---| | **Eager** | Read struct as-is, `data` has bytes | Reference as-is | | **Lazy** (default) | Prune `data` from read (no I/O), populate `reference` with pos+size | Reference as-is | **`SELECT read_blob(blob_col)` — always materializes bytes, mode-unaware:** | Storage type | Behavior | |---|---| | `INLINE` | Reader reads full struct (no pruning), `read_blob()` returns `data` bytes directly | | `OUT_OF_LINE` | `read_blob()` does batched range read via `reference` | **Merge/compaction**: eager mode (need actual bytes for rewriting). **Format support for lazy mode:** - Lance: works now (DESCRIPTOR mode provides pos+size natively) - Parquet: separate PR to extract pos+size within the Parquet file for the reference transformation

Is it possible to make this more visual? i.e. create a mermaid chart or something so it's easier for users to follow. Docusaurus supports mermaid charts as of now.

voonhous · 2026-04-27T17:07:52Z

+## BLOB Type Overview
+
+A BLOB column stores binary data in one of two modes:
+


It would be good if we can add hud's blob spec here. So users can easily correlate and see where the INLINE/OUT-OF-LINE type will be used later.

Yea nice point

voonhous · 2026-04-27T17:13:15Z

+
+All share the same Hudi catalog, metadata, and tooling.
+
+## Known Limitations


We can use the danger/warning admonition for this part.

https://docusaurus.io/docs/markdown-features/admonitions

rahil-c · 2026-04-29T07:07:46Z

+
+1. Created a Hudi table with **VECTOR** columns (for embeddings) and **BLOB** columns (for raw image bytes)
+2. Generated image embeddings using a pre-trained neural network (MobileNetV3)
+3. Written embeddings and images to a Hudi table backed by the **Lance** columnar format


we should change this to say either parquet or lance

docs: add AI-native lakehouse docs with vector search, BLOB, and Lanc…

5d8ba90

…e format Add new "AI & Unstructured Data" docs section and AI Quick Start guide covering VECTOR type, BLOB unstructured data, hudi_vector_search TVF, and Lance file format integration.

rahil-c requested review from rahil-c and voonhous April 26, 2026 04:32

voonhous reviewed Apr 27, 2026

View reviewed changes

voonhous approved these changes Apr 28, 2026

View reviewed changes

rahil-c reviewed Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add AI-native lakehouse documentation#18591

docs: add AI-native lakehouse documentation#18591
yihua wants to merge 2 commits intoapache:asf-sitefrom
yihua:docs-ai-native-lakehouse

yihua commented Apr 26, 2026

Uh oh!

voonhous Apr 27, 2026

Uh oh!

voonhous Apr 27, 2026 •

edited

Loading

Uh oh!

voonhous Apr 27, 2026

Uh oh!

voonhous Apr 27, 2026

Uh oh!

rahil-c Apr 27, 2026

Uh oh!

voonhous Apr 27, 2026

Uh oh!

rahil-c Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		## BLOB Type Overview

		A BLOB column stores binary data in one of two modes:


		All share the same Hudi catalog, metadata, and tooling.

		## Known Limitations

Conversation

yihua commented Apr 26, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

voonhous Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

voonhous Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voonhous Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

voonhous Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

rahil-c Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

voonhous Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

rahil-c Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

voonhous Apr 27, 2026 •

edited

Loading