Skip to content

docs: add AI-native lakehouse documentation#18591

Draft
yihua wants to merge 2 commits intoapache:asf-sitefrom
yihua:docs-ai-native-lakehouse
Draft

docs: add AI-native lakehouse documentation#18591
yihua wants to merge 2 commits intoapache:asf-sitefrom
yihua:docs-ai-native-lakehouse

Conversation

@yihua
Copy link
Copy Markdown
Contributor

@yihua yihua commented Apr 26, 2026

Describe the issue this Pull Request addresses

Add documentation for Hudi's AI-native capabilities to attract AI/ML users to the project.

Summary and Changelog

Add a new "AI & Unstructured Data" docs section and an AI Quick Start guide:

  • AI Quick Start (under Getting Started): hands-on tutorial for image similarity search using VECTOR, BLOB, and Lance
  • AI-Native Lakehouse Overview: landing page positioning Hudi for AI workloads (RAG, image search, feature stores, etc.)
  • Vector Search: VECTOR type reference, hudi_vector_search TVF syntax, distance metrics, embedding model compatibility
  • Unstructured Data: BLOB type reference covering inline vs out-of-line storage and read_blob() function
  • Lance File Format: Lance integration guide with setup, architecture, and comparison to Parquet for AI workloads
  • Updated overview.mdx and sidebars.js to link the new pages

Impact

Documentation-only change. No code changes.

Risk Level

none

Documentation Update

This PR is the documentation update itself. Pages are added to the current (next) version only, not backported to any released version.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

…e format

Add new "AI & Unstructured Data" docs section and AI Quick Start guide
covering VECTOR type, BLOB unstructured data, hudi_vector_search TVF,
and Lance file format integration.
@rahil-c rahil-c requested review from rahil-c and voonhous April 26, 2026 04:32
…onfigs

- Add variant_type.md covering semi-structured data with VARIANT
  (DDL for Spark 4.0+/3.x, parse_json, shredding, cross-engine compat)
- Update vector_search.md: element types (FLOAT/DOUBLE/INT8), batch TVF
  (hudi_vector_search_batch), correct column name (_hudi_distance),
  algorithm parameter, constraints section
- Update blob_unstructured_data.md: add configuration reference
  (hoodie.read.blob.inline.mode, batching configs), managed field
- Update ai_overview.md: add VARIANT section and expanded use case table
- Update sidebars.js and overview.mdx to include VARIANT

:::tip
The complete demo scripts (DataFrame API, SQL, and BLOB-only variants) are available in the
[hudi-examples](https://github.com/apache/hudi/tree/master/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo) repository.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not available as of now, let's remind ourselves to prepare the demo as a prerequisite before publishing this to asf-site.

Comment on lines +27 to +37
- Java 11+
- Python 3.10 – 3.12 (Python 3.13+ is not yet supported)
- Apache Spark 3.5.x
- Hudi 1.2.0+ (with Spark bundle)

### Python Dependencies

```bash
pip install pyspark==3.5.* pyarrow>=14.0.0 \
torch>=2.3.0 torchvision>=0.18.0 timm>=1.0.9 \
scikit-learn>=1.4.2 numpy>=1.26.0 pillow>=10.3.0 matplotlib>=3.8.0
Copy link
Copy Markdown
Member

@voonhous voonhous Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: This is a non-blocker

Another idea is that we could translate everything in here into a TestContainer test too for validation / E2E.

One problem i usually encounter is that test-scripts and code-base are usually decoupled. i.e. It is written at a point in time that corresponds to a snapshot of the state of our code.

We don't do a sanity check after changing our code, so, many things break and quickstarts no longer work after awhile.

Comment on lines +291 to +295
| Property | Default | Description |
|:---------|:--------|:------------|
| `hoodie.read.blob.inline.mode` | `CONTENT` | Controls how INLINE BLOBs are read. `CONTENT` materializes raw bytes in the `data` column. `DESCRIPTOR` surfaces `(position, size)` coordinates rewritten as OUT_OF_LINE references. |
| `hoodie.blob.batching.max.gap.bytes` | `4096` | Maximum gap (in bytes) between consecutive byte ranges before they are merged into a single read. Larger values reduce I/O calls at the cost of reading some unused bytes. |
| `hoodie.blob.batching.lookahead.size` | `50` | Number of rows to buffer for batch read detection. Larger values improve batching for sorted data but increase memory usage. |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some characteristics/nuance with DESCRIPTOR that i think @yihua summarised very nicely into a table:

**Storage (immutable):**

| Storage type | `data` | `reference` |
|---|---|---|
| `INLINE` | bytes | `null` |
| `OUT_OF_LINE` | `null` | `{path, offset, length, managed}` |

**`SELECT blob_col` (no `read_blob()`) — controlled by `hoodie.read.blob.inline.mode`:**

| Mode | Inline blob behavior | Out-of-line (unaffected) |
|---|---|---|
| **Eager** | Read struct as-is, `data` has bytes | Reference as-is |
| **Lazy** (default) | Prune `data` from read (no I/O), populate `reference` with pos+size | Reference as-is |

**`SELECT read_blob(blob_col)` — always materializes bytes, mode-unaware:**

| Storage type | Behavior |
|---|---|
| `INLINE` | Reader reads full struct (no pruning), `read_blob()` returns `data` bytes directly |
| `OUT_OF_LINE` | `read_blob()` does batched range read via `reference` |

**Merge/compaction**: eager mode (need actual bytes for rewriting).

**Format support for lazy mode:**
- Lance: works now (DESCRIPTOR mode provides pos+size natively)
- Parquet: separate PR to extract pos+size within the Parquet file for the reference transformation

Is it possible to make this more visual? i.e. create a mermaid chart or something so it's easier for users to follow. Docusaurus supports mermaid charts as of now.

## BLOB Type Overview

A BLOB column stores binary data in one of two modes:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good if we can add hud's blob spec here. So users can easily correlate and see where the INLINE/OUT-OF-LINE type will be used later.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea nice point


All share the same Hudi catalog, metadata, and tooling.

## Known Limitations
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the danger/warning admonition for this part.

https://docusaurus.io/docs/markdown-features/admonitions


1. Created a Hudi table with **VECTOR** columns (for embeddings) and **BLOB** columns (for raw image bytes)
2. Generated image embeddings using a pre-trained neural network (MobileNetV3)
3. Written embeddings and images to a Hudi table backed by the **Lance** columnar format
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should change this to say either parquet or lance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants