SparseVideoNav: Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation
Hai Zhang*
Siqi Liang*
Li Chen
Yuxian Li
Yukuan Xu
Yichao Zhong
Fu Zhang†
Hongyang Li†
The University of Hong Kong
SparseVideoNav introduces video generation models to real-world beyond-the-view vision-language navigation for the first time. It achieves sub-second trajectory inference with a sparse future spanning a 20-second horizon, yielding a remarkable 27× speed-up. Real-world zero-shot experiments show 2.5× higher success rate than state-of-the-art LLM baselines and mark the first realization in challenging night scenes.
Developers: Hai Zhang and Siqi Liang
Important
🌟 Stay up to date at opendrivelab.com!
- 🎉 2026-02-05: Project Page is now available!
- 🎉 2026-02-06: arXiv preprint is now available!
For further inquiries or assistance, please contact zhanghenryhai12138@gmail.com or liangsiqi@connect.hku.hk
- 📖 Introduction
- 📢 News
- 📬 Contact
- 🔥 Highlights
- 📝 TODO List
- 📄 License and Citation
- We investigate beyond-the-view navigation tasks in the real world by introducing video generation model to this field for the first time.
- We pioneer a paradigm shift from continuous to sparse video generation for longer prediction horizon.
- We achieve sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart.
- We achieve the first realization of beyond-the-view navigation in challenging night scenes with a 17.5% success rate.
- SparseVideoNav Paper Release.
- arXiv preprint is now available!
- SparseVideoNav Code Release.
- Inference code of distilled video generation model and model checkpoint (Estimate 2026 March).
- Inference code of continuous action head and model checkpoint (Estimate 2026 Q3).
- SparseVideoNav Dataset Release
- ~140h real-world VLN data (Estimate 2026 Q3).
All the data and code within this repo are under CC BY-NC-SA 4.0.
- Please consider citing our work if it helps your research.
@article{zhang2026sparse,
title={Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation},
author={Zhang, Hai and Liang, Siqi and Chen, Li and Li, Yuxian and Xu, Yukuan and Zhong, Yichao and Zhang, Fu and Li, Hongyang},
journal={arXiv preprint arXiv:2602.05827},
year={2026}
}