SOCIAL MEDIA TITLE TAG

Hydra-Nav: Object Navigation via Adaptive Dual-Process Reasoning

Zixuan Wang^{1, 2}, Huang Fang¹, Shaoan Wang^{1, 3},

Yuanfei Luo¹, Heng Dong¹, Wei Li^{1, †}, Yiming Gan^{4, †}

¹ByteDance Seed ²CASIA ³PKU ⁴ICT

Abstract

While large vision-language models (VLMs) show promise for object goal navigation, current methods still struggle with low success rates and inefficient localization of unseen objects—failures primarily attributed to weak temporal-spatial reasoning. Meanwhile, recent attempts to inject reasoning into VLM-based agents improve success rates but incur substantial computational overhead. To address both the ineffectiveness and inefficiency of existing approaches, we introduce Hydra-Nav, a unified VLM architecture that adaptively switches between a deliberative "slow system" for analyzing exploration history and formulating high-level plans, and a reactive "fast system" for efficient execution. We train Hydra-Nav through a three-stage curriculum: (i) spatial-action alignment to strengthen trajectory planning, (ii) memory-reasoning integration to enhance temporal-spatial reasoning over long-horizon exploration, and (iii) iterative rejection fine-tuning to enable selective reasoning at critical decision points. Extensive experiments demonstrate that Hydra-Nav achieves state-of-the-art performance on the HM3D, MP3D, and OVON benchmarks, outperforming the second-best methods by 11.1%, 17.4%, and 21.2%, respectively. Furthermore, we introduce SOT (Success weighted by Operation Time), a new metric to measure search efficiency across VLMs with varying reasoning intensity. Results show that adaptive reasoning significantly enhances search efficiency over fixed-frequency baselines.

Overview of Hydra-Nav

Key Features of Hydra-Nav:

🌟 Adaptive Dual-Process Reasoning. Hydra-Nav is a unified VLM architecture that mimics human cognition by adaptively switching between a deliberative "slow system" (for analyzing exploration history and high-level planning) and a reactive "fast system" (for efficient low-level execution), balancing high success rates with computational efficiency.
🌟 Three-Stage Curriculum Learning. We propose a progressive training pipeline: (i) spatial-action alignment to strengthen trajectory planning; (ii) memory-reasoning integration to enhance temporal-spatial reasoning over long horizons; and (iii) iterative rejection fine-tuning (IRFT) to enable the agent to autonomously learn when to trigger for reasoning.
🌟 SOTA Performance & Efficiency. Hydra-Nav achieves state-of-the-art results on HM3D, MP3D, and OVON benchmarks, outperforming second-best methods by 11.1%, 17.4%, and 21.2% respectively. It also demonstrates significantly improved search efficiency under the new SOT (Success weighted by Operation Time) metric.

Architecture of Hydra-Nav

As illustrated in Figure 1, Hydra-Nav unifies high-level planning and low-level control within a single VLM architecture.

🐢 Slow System (Deliberative Reasoning): The cycle initiates here. The model analyzes the global history and constructs a structured long-term memory. It generates a detailed Chain-of-Thought (CoT) reasoning to formulate a high-level navigation plan.
🐇 Fast System (Reactive Execution): Responsible for efficient execution. Crucially, it is guided by the reasoning generated by the slow system (injected as the system prompt). It operates as a multi-turn dialogue to autoregressively decode low level meta actions that strictly follow the high-level plan.
🔄 Adaptive Switching: The transition is self-triggered. When the agent completes a sub-goal or the current observation invalidates the existing plan (e.g., getting stuck), it outputs a special obs token to trigger the slow system for reasoning.

Results

🌟 Robust Zero-Shot Generalization. Hydra-Nav achieves a +21.1% gain on the challenging OVON Val-Unseen benchmark, effectively leveraging VLM knowledge and reasoning capability to handle unseen objects without prior training.

🌟 Effective Curriculum Learning. Hydra-Nav validates a three-stage pipeline (Base → SFT → IRFT) that integrates temporal-spatial reasoning and memory, boosting performance by 33.1% on HM3D during the SFT stage.

🌟 Cost-Efficient Adaptive Reasoning. Unlike existing VLM agents that trigger reasoning at every step (dense reasoning), Hydra-Nav activates "slow thinking" only at critical stagnation points. This selective mechanism reduces reasoning frequency to just 3.0% of per-step baselines while doubling the SOT.

BibTeX

@article{wang2026hydranav, title={Hydra-Nav: Object Navigation via Adaptive Dual-Process Reasoning}, author={Wang, Zixuan and Fang, Huang and Wang, Shaoan and Luo, Yuanfei and Dong, Heng and Li, Wei and Gan, Yiming}, year={2026}, journal={arXiv pre-print}, url={https://arxiv.org/abs/2602.09972} }

Hydra-Nav: Object Navigation via Adaptive Dual-Process Reasoning

Real-world Experiments

Visualization in simulator

Abstract

Overview of Hydra-Nav

Key Features of Hydra-Nav:

Architecture of Hydra-Nav

Figure 1. The architecture of Hydra-Nav. Hydra-Nav receives user instruction, long-term memory, and previous image-action pairs, then outputs reasoning (optionally) and meta-actions. Hydra-Nav adaptively switches between the fast and slow systems by outputting the special transition token obs.

Evaluation Results

Figure 2. Performance analysis of multi-turn IRFT across different benchmarks.

Figure 3. Reasoning Frequency analysis of multi-turn IRFT across different.

Results

BibTeX