PanoWorld Project Page

Existing MLLMs reason over fragmented local views, making it difficult to associate spatial cues in 360°. We introduce pano-native supersensing, which teaches VLMs to perceive and reason directly over 360° panoramas, providing a unified full-surround representation for downstream tasks such as human-centric visual search, omnidirectional 3D spatial reasoning, and panoramic navigation.

Core Idea

From perspective-view exploration to pano-native supersensing.

PanoWorld treats an ERP panorama as one continuous, observer-centered world. Instead of rotating through partial perspective views, the model can associate objects, directions, depth, and navigation cues across the full 360° field of view.

Before Perspective-view exploration

Sequential local views make global direction and seam context hard to maintain.

→

PanoWorld Full-surround ERP reasoning

Spherical geometry aligns the visual stream with 360° space.

→

After Downstream supersensing

Search, spatial reasoning, and navigation share one world model.

Abstract

Multimodal large language models still struggle with spatial understanding under the dominant perspective-image paradigm, which inherits the narrow field of view of human-like perception. For navigation, robotic search, and 3D scene understanding, 360° panoramic sensing offers a form of supersensing by capturing the entire surrounding environment at once. However, existing MLLM pipelines typically decompose panoramas into multiple perspective views, leaving the spherical structure of equirectangular projection (ERP) largely implicit. In this paper, we study pano-native understanding, which requires an MLLM to reason over an ERP panorama as a continuous, observer-centered space. We define key abilities for pano-native understanding, build a large-scale metadata construction pipeline that converts mixed-source ERP panoramas into geometry-aware, language-grounded, and depth-aware supervision, and introduce PanoWorld with Spherical Spatial Cross-Attention. We further construct PanoSpace-Bench, a diagnostic benchmark for evaluating ERP-native spatial reasoning. Experiments show that PanoWorld substantially outperforms proprietary and open-source baselines on PanoSpace-Bench, H*Bench, and R2R-CE Val-Unseen benchmarks.

What Enables Supersensing

The PanoWorld Supersensing Stack

01

Pano-Native Ability Foundation

We formalize the core abilities required for panoramic understanding: semantic anchoring, spherical localization, reference-frame transformation, and depth-aware 3D spatial reasoning over observer-centered ERP space.

02

Verifiable 570K ERP Data Engine

A metadata-driven construction pipeline converts mixed-source panoramas into geometry-aware, language-grounded, and depth-aware supervision through geometric and semantic verification.

03

PanoSpace-Bench Diagnostic Suite

A benchmark designed for ERP-native spatial reasoning, measuring spherical grounding, reference-frame transformation, 3D relations, seam continuity, and topology-sensitive understanding.

04

PanoWorld Geometry-Aware MLLM

PanoWorld injects spherical geometry into the visual stream via Spherical Spatial Cross-Attention, enabling pano-aware reasoning while preserving the pretrained vision-language backbone.

Dataset and Benchmark

PanoWorld combines a large-scale training resource with a benchmark designed specifically for ERP-native spatial reasoning.

PanoWorld dataset pipeline — Verifiable metadata construction pipeline. We collect mixed-source ERP panoramas, perform perspective-view detection followed by ERP reprojection and cross-view geometric verification, and enrich verified entities with semantic metadata through MLLM annotation and description-guided referring re-detection. Depth cues are then associated with each entity to build a structured metadata graph, from which both training data and PanoSpace-Bench are derived.

Why PanoWorld Data Matters

Resource	Panoramas	Depth / 3D	Entity Metadata	Scalable Annotation	Verified Graph
Dense360	160K	No	Yes	Yes	Partial
OSR-Bench	4.1K	Partial	Partial	No	No
PanoEnv	595	Yes	Partial	No	No
PanoWorld	570K	Yes	Yes	Yes	Yes

PanoWorld Architecture

PanoWorld adapts Qwen3.5-VL into a pano-aware MLLM by injecting spherical geometry into the visual stream before deep visual encoding.

Spherical token construction

s_i = MLP(γ(λ_i, φ_i)) ∈ ℝ^d

Each ERP patch center (u_i, v_i) is mapped to yaw-pitch direction (λ_i, φ_i) and encoded as a spherical spatial token.

Spherical token stack

S = [s₁, ..., s_N] ∈ ℝ^N×d

The resulting sequence preserves observer-centered geometry aligned with the ERP representation.

SSCA fusion

A = MHA(LN(H⁽⁰⁾), LN(S), LN(S))

Visual patch tokens query spherical spatial tokens through cross-attention to retrieve geometry-aware signals.

Gated residual update

H̃⁽⁰⁾ = H⁽⁰⁾ + α ⊙ A

A learnable gate controls how much spherical geometry is injected before the remaining visual blocks.

Main Results

PanoWorld substantially improves panoramic spatial reasoning on the proposed benchmark and transfers to downstream panoramic and navigation tasks.

PanoSpace-Bench Overall

56.5

vs. 30.8 for the Qwen3.5-9B panoramic baseline.

H*Bench ERP Overall

70.1

with H* SFT, improving holistic object and position sensing.

R2R-CE Success Rate

54.3

stronger panoramic navigation transfer on Val-Unseen.

R2R-CE SPL

52.1

better path efficiency while preserving high navigation success.

PanoSpace-Bench

ERP-native spatial reasoning across spherical localization, 3D relations, and seam-aware perception.

Diagnostic benchmark

Method	Overall	Abs. Dir.	BFOV	Spherical Relation Avg.	3D Spatial Avg.	Seam
GPT-4o	31.8	37.2	17.7	29.3	36.4	37.6
Mimo-v2.5	37.2	26.8	0.74	42.3	37.6	45.6
Qwen3.5-9B	30.8	25.2	1.41	26.1	36.9	41.2
Qwen3.5-9B + visual prompt	36.4	55.2	4.9	33.1	36.1	46.5
PanoWorld	56.5	93.7	73.3	47.4	49.8	65.5

H*Bench

Holistic panorama sensing under both perspective-view and ERP panorama evaluation settings.

Holistic sensing

Method	Overall	HOS	HPS	Yaw	Pitch
GPT-4o ERP	30.1	39.1	17.1	38.5	64.2
Gemini-2.5-Pro ERP	46.9	55.3	34.3	52.5	71.6
Qwen3.5-9B ERP	19.4	26.2	9.3	23.5	46.5
Qwen3.5 + visual prompt	40.4	46.0	32.0	43.5	52.0
PanoWorld + H* SFT	70.1	73.1	64.2	74.1	85.5

Navigation Transfer

R2R-CE Val-Unseen results show that pano-native visual representations transfer to embodied navigation.

R2R-CE Val-Unseen

Method	NE ↓	OSR ↑	SR ↑	SPL ↑
HPN + DN	6.31	40.0	36.0	34.0
GridMM	5.11	61.0	49.0	41.0
StreamVLN	5.73	56.4	50.2	47.1
NaVIDA	5.72	57.4	47.7	41.5
PanoWorld-VLN	4.98	59.3	54.3	52.1

Task Demos and Visual Results

Qualitative examples across PanoSpace-Bench, H*Bench, and navigation show how pano-native learning supports localization, holistic sensing, and embodied transfer.

PanoSpace-Bench Spatial Reasoning

The benchmark probes spherical localization, 3D spatial relations, viewpoint transformation, and object reorientation.

PanoWorld 3D relation reasoning result — Representative PanoSpace-Bench 3D relation case. The model must reason over an observer-centered ERP panorama to compare target-object positions and infer relative depth-aware spatial relations rather than relying on a cropped local view.

PanoWorld camera rotation reasoning result — Camera-rotation reasoning case. Given a hypothetical change in observer orientation, PanoWorld predicts where the target would appear in the transformed reference frame, testing observer-centered spherical reasoning.

PanoWorld object reorientation result — Object-conditioned reorientation case. The model tracks how a target person's direction changes under a new observer-facing frame, requiring consistent spatial grounding across ERP distortion and full-surround context.

H*Bench Holistic Sensing

H*Bench examples test holistic object sensing and holistic position sensing on panoramic scenes.

Perspective-view search compared with pano-native H star reasoning — Case comparison on H*Bench. Perspective-view iterative search is inefficient and may fail due to fragmented local observations, whereas direct ERP input enables holistic reasoning and correct prediction in one step.

PanoWorld H star holistic sensing result — Human-centric visual search example. PanoWorld directly reasons over the full ERP panorama to infer the next movement direction, supporting practical 360° search without decomposing the scene into fragmented local views.

PanoWorld H star position sensing result — Human-centric visual search example. Full-surround perception lets the model use global layout and target-position cues to select the next action from a single panoramic observation.

Navigation Transfer

Compared with RGB perspective-view navigation, panoramic input gives the agent full-surround context at each step. This reduces blind-spot exploration and helps ground instructions in global scene layout.

Panoramic navigation demo. The agent observes the full surrounding scene and the top-down trajectory simultaneously, allowing it to ground language instructions with fewer blind-spot ambiguities than narrow RGB perspective-view navigation.

Panoramic navigation demo. Full-surround ERP observations expose long-range layout cues and route alternatives in a single view, supporting more efficient instruction following in continuous environments.

Citation

@article{panoworld2026,
  title   = {PanoWorld: Towards Spatial Supersensing in 360° Panorama World},
  author  = {Wang, Changpeng and Lin, Xin and Liu, Junhan and Liu, Yuheng and Wang, Zhen and Qi, Donglian and Yan, Yunfeng and Chen, Xi},
  journal = {arXiv preprint arXiv:2605.13169},
  year    = {2026}
}