360° Panorama Pano-Native VLM Spatial Supersensing Embodied Transfer
Pano-Native Spatial Supersensing

PanoWorld: Towards Spatial Supersensing in 360° Panorama World

A pano-native multimodal learning framework that teaches VLMs to perceive and reason directly over complete 360° ERP panoramas as continuous observer-centered worlds.

Changpeng Wang1, Xin Lin2, Junhan Liu1, Yuheng Liu3, Zhen Wang1, Donglian Qi1, Yunfeng Yan1, Xi Chen4 1Zhejiang University 2University of California, San Diego 3University of California, Irvine 4The University of Hong Kong

PanoWorld teaser figure

Existing MLLMs reason over fragmented local views, making it difficult to associate spatial cues in 360°. We introduce pano-native supersensing, which teaches VLMs to perceive and reason directly over 360° panoramas, providing a unified full-surround representation for downstream tasks such as human-centric visual search, omnidirectional 3D spatial reasoning, and panoramic navigation.

Core Idea

From perspective-view exploration to pano-native supersensing.

PanoWorld treats an ERP panorama as one continuous, observer-centered world. Instead of rotating through partial perspective views, the model can associate objects, directions, depth, and navigation cues across the full 360° field of view.

Before Perspective-view exploration

Sequential local views make global direction and seam context hard to maintain.

PanoWorld Full-surround ERP reasoning

Spherical geometry aligns the visual stream with 360° space.

After Downstream supersensing

Search, spatial reasoning, and navigation share one world model.

Abstract

Multimodal large language models still struggle with spatial understanding under the dominant perspective-image paradigm, which inherits the narrow field of view of human-like perception. For navigation, robotic search, and 3D scene understanding, 360° panoramic sensing offers a form of supersensing by capturing the entire surrounding environment at once. However, existing MLLM pipelines typically decompose panoramas into multiple perspective views, leaving the spherical structure of equirectangular projection (ERP) largely implicit. In this paper, we study pano-native understanding, which requires an MLLM to reason over an ERP panorama as a continuous, observer-centered space. We define key abilities for pano-native understanding, build a large-scale metadata construction pipeline that converts mixed-source ERP panoramas into geometry-aware, language-grounded, and depth-aware supervision, and introduce PanoWorld with Spherical Spatial Cross-Attention. We further construct PanoSpace-Bench, a diagnostic benchmark for evaluating ERP-native spatial reasoning. Experiments show that PanoWorld substantially outperforms proprietary and open-source baselines on PanoSpace-Bench, H*Bench, and R2R-CE Val-Unseen benchmarks.

What Enables Supersensing

The PanoWorld Supersensing Stack

01

Pano-Native Ability Foundation

We formalize the core abilities required for panoramic understanding: semantic anchoring, spherical localization, reference-frame transformation, and depth-aware 3D spatial reasoning over observer-centered ERP space.

02

Verifiable 570K ERP Data Engine

A metadata-driven construction pipeline converts mixed-source panoramas into geometry-aware, language-grounded, and depth-aware supervision through geometric and semantic verification.

03

PanoSpace-Bench Diagnostic Suite

A benchmark designed for ERP-native spatial reasoning, measuring spherical grounding, reference-frame transformation, 3D relations, seam continuity, and topology-sensitive understanding.

04

PanoWorld Geometry-Aware MLLM

PanoWorld injects spherical geometry into the visual stream via Spherical Spatial Cross-Attention, enabling pano-aware reasoning while preserving the pretrained vision-language backbone.

Dataset and Benchmark

PanoWorld combines a large-scale training resource with a benchmark designed specifically for ERP-native spatial reasoning.

PanoWorld dataset pipeline
Verifiable metadata construction pipeline. We collect mixed-source ERP panoramas, perform perspective-view detection followed by ERP reprojection and cross-view geometric verification, and enrich verified entities with semantic metadata through MLLM annotation and description-guided referring re-detection. Depth cues are then associated with each entity to build a structured metadata graph, from which both training data and PanoSpace-Bench are derived.

Why PanoWorld Data Matters

Resource Panoramas Depth / 3D Entity Metadata Scalable Annotation Verified Graph
Dense360 160K No Yes Yes Partial
OSR-Bench 4.1K Partial Partial No No
PanoEnv 595 Yes Partial No No
PanoWorld 570K Yes Yes Yes Yes

PanoWorld Architecture

PanoWorld adapts Qwen3.5-VL into a pano-aware MLLM by injecting spherical geometry into the visual stream before deep visual encoding.

PanoWorld architecture
The architecture of PanoWorld. After patch embedding, visual tokens query spherical spatial tokens derived from ERP patch centers, producing a geometry-aware signal that is fused through a gated residual update. The enhanced tokens are then fed into the remaining pretrained visual encoder, enabling pano-aware spatial reasoning while preserving the original backbone.
Spherical token construction
si = MLP(γ(λi, φi)) ∈ ℝd

Each ERP patch center (ui, vi) is mapped to yaw-pitch direction (λi, φi) and encoded as a spherical spatial token.

Spherical token stack
S = [s1, ..., sN] ∈ ℝN×d

The resulting sequence preserves observer-centered geometry aligned with the ERP representation.

SSCA fusion
A = MHA(LN(H(0)), LN(S), LN(S))

Visual patch tokens query spherical spatial tokens through cross-attention to retrieve geometry-aware signals.

Gated residual update
(0) = H(0) + α ⊙ A

A learnable gate controls how much spherical geometry is injected before the remaining visual blocks.

Main Results

PanoWorld substantially improves panoramic spatial reasoning on the proposed benchmark and transfers to downstream panoramic and navigation tasks.

PanoSpace-Bench Overall
56.5

vs. 30.8 for the Qwen3.5-9B panoramic baseline.

H*Bench ERP Overall
70.1

with H* SFT, improving holistic object and position sensing.

R2R-CE Success Rate
54.3

stronger panoramic navigation transfer on Val-Unseen.

R2R-CE SPL
52.1

better path efficiency while preserving high navigation success.

PanoSpace-Bench

ERP-native spatial reasoning across spherical localization, 3D relations, and seam-aware perception.

Diagnostic benchmark
Method Overall Abs. Dir. BFOV Spherical Relation Avg. 3D Spatial Avg. Seam
GPT-4o 31.8 37.2 17.7 29.3 36.4 37.6
Mimo-v2.5 37.2 26.8 0.74 42.3 37.6 45.6
Qwen3.5-9B 30.8 25.2 1.41 26.1 36.9 41.2
Qwen3.5-9B + visual prompt 36.4 55.2 4.9 33.1 36.1 46.5
PanoWorld 56.5 93.7 73.3 47.4 49.8 65.5

H*Bench

Holistic panorama sensing under both perspective-view and ERP panorama evaluation settings.

Holistic sensing
Method Overall HOS HPS Yaw Pitch
GPT-4o ERP 30.1 39.1 17.1 38.5 64.2
Gemini-2.5-Pro ERP 46.9 55.3 34.3 52.5 71.6
Qwen3.5-9B ERP 19.4 26.2 9.3 23.5 46.5
Qwen3.5 + visual prompt 40.4 46.0 32.0 43.5 52.0
PanoWorld + H* SFT 70.1 73.1 64.2 74.1 85.5

Navigation Transfer

R2R-CE Val-Unseen results show that pano-native visual representations transfer to embodied navigation.

R2R-CE Val-Unseen
Method NE ↓ OSR ↑ SR ↑ SPL ↑
HPN + DN 6.31 40.0 36.0 34.0
GridMM 5.11 61.0 49.0 41.0
StreamVLN 5.73 56.4 50.2 47.1
NaVIDA 5.72 57.4 47.7 41.5
PanoWorld-VLN 4.98 59.3 54.3 52.1

Task Demos and Visual Results

Qualitative examples across PanoSpace-Bench, H*Bench, and navigation show how pano-native learning supports localization, holistic sensing, and embodied transfer.

PanoSpace-Bench Spatial Reasoning

The benchmark probes spherical localization, 3D spatial relations, viewpoint transformation, and object reorientation.

PanoWorld 3D relation reasoning result
Representative PanoSpace-Bench 3D relation case. The model must reason over an observer-centered ERP panorama to compare target-object positions and infer relative depth-aware spatial relations rather than relying on a cropped local view.
PanoWorld camera rotation reasoning result
Camera-rotation reasoning case. Given a hypothetical change in observer orientation, PanoWorld predicts where the target would appear in the transformed reference frame, testing observer-centered spherical reasoning.
PanoWorld object reorientation result
Object-conditioned reorientation case. The model tracks how a target person's direction changes under a new observer-facing frame, requiring consistent spatial grounding across ERP distortion and full-surround context.

H*Bench Holistic Sensing

H*Bench examples test holistic object sensing and holistic position sensing on panoramic scenes.

Perspective-view search compared with pano-native H star reasoning
Case comparison on H*Bench. Perspective-view iterative search is inefficient and may fail due to fragmented local observations, whereas direct ERP input enables holistic reasoning and correct prediction in one step.
PanoWorld H star holistic sensing result
Human-centric visual search example. PanoWorld directly reasons over the full ERP panorama to infer the next movement direction, supporting practical 360° search without decomposing the scene into fragmented local views.
PanoWorld H star position sensing result
Human-centric visual search example. Full-surround perception lets the model use global layout and target-position cues to select the next action from a single panoramic observation.

Navigation Transfer

Compared with RGB perspective-view navigation, panoramic input gives the agent full-surround context at each step. This reduces blind-spot exploration and helps ground instructions in global scene layout.

Panoramic navigation demo. The agent observes the full surrounding scene and the top-down trajectory simultaneously, allowing it to ground language instructions with fewer blind-spot ambiguities than narrow RGB perspective-view navigation.
Panoramic navigation demo. Full-surround ERP observations expose long-range layout cues and route alternatives in a single view, supporting more efficient instruction following in continuous environments.

Citation

@article{panoworld2026,
  title   = {PanoWorld: Towards Spatial Supersensing in 360° Panorama World},
  author  = {Wang, Changpeng and Lin, Xin and Liu, Junhan and Liu, Yuheng and Wang, Zhen and Qi, Donglian and Yan, Yunfeng and Chen, Xi},
  journal = {arXiv preprint arXiv:2605.13169},
  year    = {2026}
}