Hydra-Nav: Object Navigation via Adaptive Dual-Process Reasoning

Zixuan Wang1, 2, Huang Fang1, Shaoan Wang1, 3,
Yuanfei Luo1, Heng Dong1, Wei Li1, †, Yiming Gan4, †
1ByteDance Seed       2CASIA       3PKU       4ICT      
Corresponding Authors
Correspondence: liwei.85@bytedance.com, ganyiming@ict.ac.cn

Abstract

While large vision-language models (VLMs) show promise for object goal navigation, current methods still struggle with low success rates and inefficient localization of unseen objects—failures primarily attributed to weak temporal-spatial reasoning. Meanwhile, recent attempts to inject reasoning into VLM-based agents improve success rates but incur substantial computational overhead. To address both the ineffectiveness and inefficiency of existing approaches, we introduce Hydra-Nav, a unified VLM architecture that adaptively switches between a deliberative "slow system" for analyzing exploration history and formulating high-level plans, and a reactive "fast system" for efficient execution. We train Hydra-Nav through a three-stage curriculum: (i) spatial-action alignment to strengthen trajectory planning, (ii) memory-reasoning integration to enhance temporal-spatial reasoning over long-horizon exploration, and (iii) iterative rejection fine-tuning to enable selective reasoning at critical decision points. Extensive experiments demonstrate that Hydra-Nav achieves state-of-the-art performance on the HM3D, MP3D, and OVON benchmarks, outperforming the second-best methods by 11.1%, 17.4%, and 21.2%, respectively. Furthermore, we introduce SOT (Success weighted by Operation Time), a new metric to measure search efficiency across VLMs with varying reasoning intensity. Results show that adaptive reasoning significantly enhances search efficiency over fixed-frequency baselines.

Overview of Hydra-Nav

Key Features of Hydra-Nav:

  • 🌟 Adaptive Dual-Process Reasoning. Hydra-Nav is a unified VLM architecture that mimics human cognition by adaptively switching between a deliberative "slow system" (for analyzing exploration history and high-level planning) and a reactive "fast system" (for efficient low-level execution), balancing high success rates with computational efficiency.
  • 🌟 Three-Stage Curriculum Learning. We propose a progressive training pipeline: (i) spatial-action alignment to strengthen trajectory planning; (ii) memory-reasoning integration to enhance temporal-spatial reasoning over long horizons; and (iii) iterative rejection fine-tuning (IRFT) to enable the agent to autonomously learn when to trigger for reasoning.
  • 🌟 SOTA Performance & Efficiency. Hydra-Nav achieves state-of-the-art results on HM3D, MP3D, and OVON benchmarks, outperforming second-best methods by 11.1%, 17.4%, and 21.2% respectively. It also demonstrates significantly improved search efficiency under the new SOT (Success weighted by Operation Time) metric.

Architecture of Hydra-Nav

As illustrated in Figure 1, Hydra-Nav unifies high-level planning and low-level control within a single VLM architecture.

  • 🐢 Slow System (Deliberative Reasoning): The cycle initiates here. The model analyzes the global history and constructs a structured long-term memory. It generates a detailed Chain-of-Thought (CoT) reasoning to formulate a high-level navigation plan.
  • 🐇 Fast System (Reactive Execution): Responsible for efficient execution. Crucially, it is guided by the reasoning generated by the slow system (injected as the system prompt). It operates as a multi-turn dialogue to autoregressively decode low level meta actions that strictly follow the high-level plan.
  • 🔄 Adaptive Switching: The transition is self-triggered. When the agent completes a sub-goal or the current observation invalidates the existing plan (e.g., getting stuck), it outputs a special obs token to trigger the slow system for reasoning.
Teaser Image


Figure 1. The architecture of Hydra-Nav. Hydra-Nav receives user instruction, long-term memory, and previous image-action pairs, then outputs reasoning (optionally) and meta-actions. Hydra-Nav adaptively switches between the fast and slow systems by outputting the special transition token obs.

Evaluation Results

We evaluate Hydra-Nav on three standard benchmarks: HM3D and MP3D for object goal navigation, and OVON for open-vocabulary object navigatiosks. Following standard protocols, we report performance using success Rate (SR) and Success weighted by Path Length (SPL).

MY ALT TEXT


Table 1. Comparison with state-of-the-art methods on HM3D, MP3D, and OVON benchmarks. The attributes denote: Low: low-level action output, High: high-level planning/action output, Dual: dual-system architecture. RGB: uses RGB observations. Depth: uses Depth observations. The best and second best results are denoted by bold and underline.


MY ALT TEXT


Figure 2. Performance analysis of multi-turn IRFT across different benchmarks.


MY ALT TEXT


Figure 3. Reasoning Frequency analysis of multi-turn IRFT across different.

Results

BibTeX

@article{wang2026hydranav,
  title={Hydra-Nav: Object Navigation via Adaptive Dual-Process Reasoning},
  author={Wang, Zixuan and Fang, Huang and Wang, Shaoan and Luo, Yuanfei and Dong, Heng and Li, Wei and Gan, Yiming},
  year={2026},
  journal={arXiv pre-print},
  url={https://arxiv.org/abs/2602.09972}
}