docs: add AI-native lakehouse documentation#18591
docs: add AI-native lakehouse documentation#18591yihua wants to merge 2 commits intoapache:asf-sitefrom
Conversation
…e format Add new "AI & Unstructured Data" docs section and AI Quick Start guide covering VECTOR type, BLOB unstructured data, hudi_vector_search TVF, and Lance file format integration.
…onfigs - Add variant_type.md covering semi-structured data with VARIANT (DDL for Spark 4.0+/3.x, parse_json, shredding, cross-engine compat) - Update vector_search.md: element types (FLOAT/DOUBLE/INT8), batch TVF (hudi_vector_search_batch), correct column name (_hudi_distance), algorithm parameter, constraints section - Update blob_unstructured_data.md: add configuration reference (hoodie.read.blob.inline.mode, batching configs), managed field - Update ai_overview.md: add VARIANT section and expanded use case table - Update sidebars.js and overview.mdx to include VARIANT
|
|
||
| :::tip | ||
| The complete demo scripts (DataFrame API, SQL, and BLOB-only variants) are available in the | ||
| [hudi-examples](https://github.com/apache/hudi/tree/master/hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo) repository. |
There was a problem hiding this comment.
These are not available as of now, let's remind ourselves to prepare the demo as a prerequisite before publishing this to asf-site.
| - Java 11+ | ||
| - Python 3.10 – 3.12 (Python 3.13+ is not yet supported) | ||
| - Apache Spark 3.5.x | ||
| - Hudi 1.2.0+ (with Spark bundle) | ||
|
|
||
| ### Python Dependencies | ||
|
|
||
| ```bash | ||
| pip install pyspark==3.5.* pyarrow>=14.0.0 \ | ||
| torch>=2.3.0 torchvision>=0.18.0 timm>=1.0.9 \ | ||
| scikit-learn>=1.4.2 numpy>=1.26.0 pillow>=10.3.0 matplotlib>=3.8.0 |
There was a problem hiding this comment.
NOTE: This is a non-blocker
Another idea is that we could translate everything in here into a TestContainer test too for validation / E2E.
One problem i usually encounter is that test-scripts and code-base are usually decoupled. i.e. It is written at a point in time that corresponds to a snapshot of the state of our code.
We don't do a sanity check after changing our code, so, many things break and quickstarts no longer work after awhile.
| | Property | Default | Description | | ||
| |:---------|:--------|:------------| | ||
| | `hoodie.read.blob.inline.mode` | `CONTENT` | Controls how INLINE BLOBs are read. `CONTENT` materializes raw bytes in the `data` column. `DESCRIPTOR` surfaces `(position, size)` coordinates rewritten as OUT_OF_LINE references. | | ||
| | `hoodie.blob.batching.max.gap.bytes` | `4096` | Maximum gap (in bytes) between consecutive byte ranges before they are merged into a single read. Larger values reduce I/O calls at the cost of reading some unused bytes. | | ||
| | `hoodie.blob.batching.lookahead.size` | `50` | Number of rows to buffer for batch read detection. Larger values improve batching for sorted data but increase memory usage. | |
There was a problem hiding this comment.
There are some characteristics/nuance with DESCRIPTOR that i think @yihua summarised very nicely into a table:
**Storage (immutable):**
| Storage type | `data` | `reference` |
|---|---|---|
| `INLINE` | bytes | `null` |
| `OUT_OF_LINE` | `null` | `{path, offset, length, managed}` |
**`SELECT blob_col` (no `read_blob()`) — controlled by `hoodie.read.blob.inline.mode`:**
| Mode | Inline blob behavior | Out-of-line (unaffected) |
|---|---|---|
| **Eager** | Read struct as-is, `data` has bytes | Reference as-is |
| **Lazy** (default) | Prune `data` from read (no I/O), populate `reference` with pos+size | Reference as-is |
**`SELECT read_blob(blob_col)` — always materializes bytes, mode-unaware:**
| Storage type | Behavior |
|---|---|
| `INLINE` | Reader reads full struct (no pruning), `read_blob()` returns `data` bytes directly |
| `OUT_OF_LINE` | `read_blob()` does batched range read via `reference` |
**Merge/compaction**: eager mode (need actual bytes for rewriting).
**Format support for lazy mode:**
- Lance: works now (DESCRIPTOR mode provides pos+size natively)
- Parquet: separate PR to extract pos+size within the Parquet file for the reference transformation
Is it possible to make this more visual? i.e. create a mermaid chart or something so it's easier for users to follow. Docusaurus supports mermaid charts as of now.
| ## BLOB Type Overview | ||
|
|
||
| A BLOB column stores binary data in one of two modes: | ||
|
|
There was a problem hiding this comment.
It would be good if we can add hud's blob spec here. So users can easily correlate and see where the INLINE/OUT-OF-LINE type will be used later.
|
|
||
| All share the same Hudi catalog, metadata, and tooling. | ||
|
|
||
| ## Known Limitations |
There was a problem hiding this comment.
We can use the danger/warning admonition for this part.
|
|
||
| 1. Created a Hudi table with **VECTOR** columns (for embeddings) and **BLOB** columns (for raw image bytes) | ||
| 2. Generated image embeddings using a pre-trained neural network (MobileNetV3) | ||
| 3. Written embeddings and images to a Hudi table backed by the **Lance** columnar format |
There was a problem hiding this comment.
we should change this to say either parquet or lance
Describe the issue this Pull Request addresses
Add documentation for Hudi's AI-native capabilities to attract AI/ML users to the project.
Summary and Changelog
Add a new "AI & Unstructured Data" docs section and an AI Quick Start guide:
hudi_vector_searchTVF syntax, distance metrics, embedding model compatibilityread_blob()functionoverview.mdxandsidebars.jsto link the new pagesImpact
Documentation-only change. No code changes.
Risk Level
none
Documentation Update
This PR is the documentation update itself. Pages are added to the current (next) version only, not backported to any released version.
Contributor's checklist