ALDEN: Agentic Long-document Document Intelligence

ALDEN is a multi-modal reinforcement learning framework designed for Agentic Visually-Rich Document Understanding (A-VRDU). Built upon Qwen2.5-VL, it introduces a novel fetch action, cross-level reward and a visual semantic anchoring mechanism to enable efficient navigation and reasoning over long, high-resolution documents.

This repository contains the official implementation of our paper: ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents..

🛠️ Installation

Installing the Training Environment

conda create -n alden python=3.10
conda activate alden
git clone https://github.com/gipplab/ALDEN.git
cd ./ALDEN
pip install -e .

Installing the Single-Vector Retriever Environment

conda create -n alden-sv python=3.10
cd ./ALDEN
pip install -r single-vec_retriever_requirements.txt
cd ./flashrag
pip install -e .

Installing the Multi-vector Retriever Environment

conda create -n alden-mv python=3.10
cd ./ALDEN
pip install -r multi-vec_retriever_requirements.txt
cd ./flashrag
pip install -e .

📂 Dataset Preprocessing

1. Corpus Building

We provide the processed training corpus on Hugging Face: SkyFishQ/ALDEN.

If you wish to build the corpus from scratch using your own data:

Modify the raw_data_path and target_path in rag_serving/build_corpus.py.
Run the build script:

python rag_serving/build_corpus.py

2. Image Index Building

We use flashrag to build the dense retrieval index for document images.

cd ./flashrag/flashrag/retriever

python index_builder.py \
    --retrieval_method vdr-2b-v1 \
    --model_path llamaindex/vdr-2b-v1 \
    --corpus_path /path/to/your/images_corpus/images.parquet \
    --save_dir /path/to/save/images_index \
    --max_length 512 \
    --batch_size 128 \
    --faiss_type Flat \
    --index_modal image \
    --sentence_transformer \
    --save_embedding

python index_builder.py \
	--retrieval_method gte-Qwen2-1.5B-instruct \
	--model_path Alibaba-NLP/gte-Qwen2-1.5B-instruct \
	--corpus_path /path/to/your/images_corpus/images.parquet \
	--save_dir /path/to/save/images_index \
	--max_length 4096 \
	--batch_size 128 \
	--faiss_type Flat \
	--index_modal text \
	--sentence_transformer \
	--save_embedding

python index_builder.py 
	--retrieval_method jina-colbert-v2 \
	--model_path jinaai/jina-colbert-v2 \
	--corpus_path /path/to/your/images_corpus/images.parquet \ 
	--save_dir /path/to/save/images_index \ 
	--max_length 4096 \
	--batch_size 128 \
	--faiss_type Flat \
	--index_modal text \
	--save_embedding

python index_builder.py 
	--retrieval_method colqwen2-v1.0 \
	--model_path vidore/colqwen2-v1.0 \
	--corpus_path /path/to/your/images_corpus/images.parquet \ 
	--save_dir /path/to/save/images_index \
	--max_length 4096 \
	--batch_size 128 \
	--faiss_type Flat \
	--index_modal image \
	--save_embedding

Note: Please replace /path/to/your/... with your actual file paths.

🚀 Launch RL Training

ALDEN uses a decoupled architecture where the environment (RAG tools) and the agent (RL training) run separately.

Step 1: Tool Environment Serving

First, launch the RAG environment server which handles the <search> and <fetch> actions.

Get the Server IP:
```
hostname --ip-address
```
Take note of this IP address, you will need to configure it in the training script.

Start the Service:

python rag_serving/serving.py \
    --config rag_serving/serving_config_single-vec.yaml \
    --num_retriever 8 \
    --port 42354

or

python rag_serving/serving.py \
    --config rag_serving/serving_config_multi-vec.yaml \
    --num_retriever 8 \
    --port 42354

We initially set two retrievers for each GPU. Adapt the number of GPU and retriever according to the specific devices in the yaml file.

Step 2: RL Training

Once the tool server is running, start the training. Ensure the server URL in the training script points to the IP obtained in Step 1.

bash examples/baselines/qwen2_5_vl_7b_doc_agent_ppo.sh

⚡ Inference

To run inference on test sets:

bash examples/baselines/qwen2_5_vl_7b_doc_agent_generation.sh

💾 Model Utils

Merge Checkpoints in the Hugging Face Format

python scripts/model_merger.py \
    --local_dir checkpoints/easy_r1/exp_name/global_step_1/actor

📜 Citation

If you find this project useful, please cite our paper:

@article{yang2025alden,
  title={ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents},
  author={Yang, Tianyu and Ruas, Terry and Tian, Yijun and Wahle, Jan Philip and Kurzawe, Daniel and Gipp, Bela},
  journal={arXiv preprint arXiv:2510.25668},
  year={2025}
}

🙌 Acknowledgements

This work is built upon the following excellent open-source projects:

EasyR1: For the RL infrastructure.
VAGEN: For visual agent baselines.
verl: For efficient RL training.
ReCall: For RAG integration concepts.

We greatly appreciate their valuable contributions to the community.

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
.github		.github
.idea		.idea
assets		assets
backups		backups
examples		examples
flashrag		flashrag
rag_serving		rag_serving
scripts		scripts
verl		verl
.DS_Store		.DS_Store
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Dockerfile.legacy		Dockerfile.legacy
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SimSun.ttf		SimSun.ttf
multi-vec_retriever_requirements.txt		multi-vec_retriever_requirements.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
single-vec_retriever_requirements.txt		single-vec_retriever_requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ALDEN: Agentic Long-document Document Intelligence

🛠️ Installation

Installing the Training Environment

Installing the Single-Vector Retriever Environment

Installing the Multi-vector Retriever Environment

📂 Dataset Preprocessing

1. Corpus Building

2. Image Index Building

🚀 Launch RL Training

Step 1: Tool Environment Serving

Step 2: RL Training

⚡ Inference

💾 Model Utils

Merge Checkpoints in the Hugging Face Format

📜 Citation

🙌 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ALDEN: Agentic Long-document Document Intelligence

🛠️ Installation

Installing the Training Environment

Installing the Single-Vector Retriever Environment

Installing the Multi-vector Retriever Environment

📂 Dataset Preprocessing

1. Corpus Building

2. Image Index Building

🚀 Launch RL Training

Step 1: Tool Environment Serving

Step 2: RL Training

⚡ Inference

💾 Model Utils

Merge Checkpoints in the Hugging Face Format

📜 Citation

🙌 Acknowledgements

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages