ALDEN is a multi-modal reinforcement learning framework designed for Agentic Visually-Rich Document Understanding (A-VRDU). Built upon Qwen2.5-VL, it introduces a novel fetch action, cross-level reward and a visual semantic anchoring mechanism to enable efficient navigation and reasoning over long, high-resolution documents.
This repository contains the official implementation of our paper: ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents..
conda create -n alden python=3.10
conda activate alden
git clone https://github.com/gipplab/ALDEN.git
cd ./ALDEN
pip install -e .conda create -n alden-sv python=3.10
cd ./ALDEN
pip install -r single-vec_retriever_requirements.txt
cd ./flashrag
pip install -e .conda create -n alden-mv python=3.10
cd ./ALDEN
pip install -r multi-vec_retriever_requirements.txt
cd ./flashrag
pip install -e .We provide the processed training corpus on Hugging Face: SkyFishQ/ALDEN.
If you wish to build the corpus from scratch using your own data:
- Modify the
raw_data_pathandtarget_pathinrag_serving/build_corpus.py. - Run the build script:
python rag_serving/build_corpus.pyWe use flashrag to build the dense retrieval index for document images.
cd ./flashrag/flashrag/retriever
python index_builder.py \
--retrieval_method vdr-2b-v1 \
--model_path llamaindex/vdr-2b-v1 \
--corpus_path /path/to/your/images_corpus/images.parquet \
--save_dir /path/to/save/images_index \
--max_length 512 \
--batch_size 128 \
--faiss_type Flat \
--index_modal image \
--sentence_transformer \
--save_embeddingpython index_builder.py \
--retrieval_method gte-Qwen2-1.5B-instruct \
--model_path Alibaba-NLP/gte-Qwen2-1.5B-instruct \
--corpus_path /path/to/your/images_corpus/images.parquet \
--save_dir /path/to/save/images_index \
--max_length 4096 \
--batch_size 128 \
--faiss_type Flat \
--index_modal text \
--sentence_transformer \
--save_embeddingpython index_builder.py
--retrieval_method jina-colbert-v2 \
--model_path jinaai/jina-colbert-v2 \
--corpus_path /path/to/your/images_corpus/images.parquet \
--save_dir /path/to/save/images_index \
--max_length 4096 \
--batch_size 128 \
--faiss_type Flat \
--index_modal text \
--save_embeddingpython index_builder.py
--retrieval_method colqwen2-v1.0 \
--model_path vidore/colqwen2-v1.0 \
--corpus_path /path/to/your/images_corpus/images.parquet \
--save_dir /path/to/save/images_index \
--max_length 4096 \
--batch_size 128 \
--faiss_type Flat \
--index_modal image \
--save_embeddingNote: Please replace /path/to/your/... with your actual file paths.
ALDEN uses a decoupled architecture where the environment (RAG tools) and the agent (RL training) run separately.
First, launch the RAG environment server which handles the <search> and <fetch> actions.
-
Get the Server IP:
hostname --ip-address
Take note of this IP address, you will need to configure it in the training script.
-
Start the Service:
python rag_serving/serving.py \ --config rag_serving/serving_config_single-vec.yaml \ --num_retriever 8 \ --port 42354or
python rag_serving/serving.py \ --config rag_serving/serving_config_multi-vec.yaml \ --num_retriever 8 \ --port 42354We initially set two retrievers for each GPU. Adapt the number of GPU and retriever according to the specific devices in the yaml file.
Once the tool server is running, start the training. Ensure the server URL in the training script points to the IP obtained in Step 1.
bash examples/baselines/qwen2_5_vl_7b_doc_agent_ppo.shTo run inference on test sets:
bash examples/baselines/qwen2_5_vl_7b_doc_agent_generation.shpython scripts/model_merger.py \
--local_dir checkpoints/easy_r1/exp_name/global_step_1/actorIf you find this project useful, please cite our paper:
@article{yang2025alden,
title={ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents},
author={Yang, Tianyu and Ruas, Terry and Tian, Yijun and Wahle, Jan Philip and Kurzawe, Daniel and Gipp, Bela},
journal={arXiv preprint arXiv:2510.25668},
year={2025}
}
This work is built upon the following excellent open-source projects:
- EasyR1: For the RL infrastructure.
- VAGEN: For visual agent baselines.
- verl: For efficient RL training.
- ReCall: For RAG integration concepts.
We greatly appreciate their valuable contributions to the community.
