HeteroArch-GS: Aerial-Ground Mesh-guided Gaussian Splatting for Heterogeneous Architectural Landmarks with a Real-World Dataset

Junfan Wang1, Han Hu*1, Zhihao Jia1, Yang Jia2, Bo Xiang2, Jiwei Deng3, Qing Zhu1

1 Faculty of Geosciences and Engineering, Southwest Jiaotong University · 2 Sichuan Highway Planning, Survey, Design and Research Institute Ltd. · 3 China Railway Design Corporation

preprint · 2026

Abstract

Fusing aerial and ground imagery to reconstruct heterogeneous architectural landmarks poses significant challenges for 3D Gaussian Splatting (3DGS). Beyond the vast differences in viewing angles and spatial scales, the highly irregular topologies and complex structures of such landmarks exacerbate optimization conflicts, often resulting in severe geometric distortions and blurring during naive joint training. In this paper, we propose HeteroArch-GS, a novel framework that resolves these cross-view fusion issues by integrating the multi-view geometric consistency of mesh models with the photorealistic rendering of 3DGS. By employing oblique photogrammetric meshes as robust geometric and textural priors, we utilize mesh-guided initialization, geometric regularization, and pseudo-view supervision to explicitly constrain Gaussian primitives to complex physical manifolds, ensuring high-fidelity rendering across aerial and ground viewpoints. To support this, we construct AGC Landmarks, a novel real-world RGB dataset capturing diverse heterogeneous landmarks with aerial, ground, and object-centric perspectives, alongside an efficient lazy loading strategy for processing large-scale data. Extensive experiments demonstrate that our method achieves state-of-the-art rendering quality and geometric correctness, robustly preserving intricate architectural details particularly for out-of-distribution object-centric views.

Background and Challenges

City-scale aerial oblique photogrammetry has produced large textured mesh assets for many urban areas, supporting real-world 3D city modeling. These assets are usually reconstructed from aerial oblique images through SfM, MVS, and texture mapping. Landmark buildings often require higher fidelity than ordinary urban blocks because they contain distinctive facades, roofs, hollow structures, decorations, and complex materials. Aerial-derived meshes frequently miss or distort lower facades, overhangs, thin components, and intricate details, especially when viewed from the ground. Supplementary ground imagery can recover missing facade information and improve visual realism.Recent neural rendering methods, especially 3D Gaussian Splatting (3DGS), make it possible to move from mesh refinement toward photorealistic aerial-ground rendering. In practical workflows, existing aerial photogrammetric meshes can serve as geometric priors for reconstructing heterogeneous architectural landmarks. However, the following challenges arise when fusing aerial and ground imagery:

  • Naive aerial-ground 3DGS optimization is unstable: aerial-only and ground-only settings may work within their own view domains, but joint optimization can degrade both domains. Heterogeneous landmarks amplify this conflict because overhangs, hollow spaces, thin components, and non-Lambertian materials generate ambiguous or inconsistent photometric gradients.

  • Real-world aerial-ground benchmarks remain insufficient: existing evaluations still rely heavily on synthetic cities, where aerial and street-level views share simplified geometry and controlled radiometric states. Practical landmark captures instead involve temporal illumination changes, local occlusions, irregular structures, complex materials, noise, and dynamic disturbances, leaving a substantial gap between benchmark performance and deployable aerial-ground reconstruction.

Failure of naive aerial-ground joint training in 3DGS. We consider three viewpoint types for a target building: (A) aerial, (B) object-centric, and (C) ground views. Models trained only on aerial or ground images render their respective in-distribution views well, whereas naive joint training on aerial and ground images degrades even in-distribution results. All settings also struggle on out-of-distribution object-centric views.
Failure of naive aerial-ground joint training in 3DGS. We consider three viewpoint types for a target building: (A) aerial, (B) object-centric, and (C) ground views. Models trained only on aerial or ground images render their respective in-distribution views well, whereas naive joint training on aerial and ground images degrades even in-distribution results. All settings also struggle on out-of-distribution object-centric views.

Contributions

  • We construct AGC Landmarks, a real-world optical image dataset for heterogeneous architectural landmarks, explicitly covering aerial, ground, and object-centric perspectives. Beyond benchmarking neural rendering on complex topologies, it exposes real illumination changes across unconstrained outdoor captures and provides a concrete basis for studying lighting models in the wild.

  • We propose HeteroArch-GS, a mesh-guided framework for aerial-ground 3DGS. It turns the existing oblique photogrammetric mesh into a strong geometric prior through anchor initialization, surface-aware regularization, and pseudo-view supervision. This keeps Gaussian primitives tied to plausible physical surfaces and substantially improves rendering from out-of-distribution object-centric views.

  • We introduce an Efficient Lazy Loading Strategy that removes the memory bottleneck of massive image sets, enabling scalable 3DGS training under limited computational resources.

AGC Landmarks

We collected 10 building-scale landmarks, with an emphasis on architectural diversity and structural heterogeneity. These targets cover six representative types: hollow buildings, large venues, irregular envelopes, ancient buildings, castle, and sculpture. Together, these scenes stress reconstruction in multiple ways. Their hollow layouts and dense decorative details create severe self-occlusion, while reflective glass, curved roofs, lakeside backgrounds, and cliff-side terrain further challenge robust geometry and appearance modeling. For each landmark, we acquire three complementary image subsets:

  • Aerial imagery : captured along oblique photogrammetry flight paths to cover the entire survey area;

  • Ground imagery : captured horizontally and at slight upward angles from a specific height above the ground to simulate pedestrian viewpoints;

  • Object-centric imagery : captured via dense, close-range photogrammetry orbiting the target to ensure complete capture of intricate details from multiple perspectives.

The visualization of camera frustums and sparse pointclouds of the proposed dataset. The dataset consists of 10 individual building scenes, labeled (a) to (j). For each scene, high-resolution RGB images are captured from three distinct perspectives: aerial (red), ground (green), and object-centric (blue).
The visualization of camera frustums and sparse pointclouds of the proposed dataset. The dataset consists of 10 individual building scenes, labeled (a) to (j). For each scene, high-resolution RGB images are captured from three distinct perspectives: aerial (red), ground (green), and object-centric (blue).

HeteroArch-GS

HeteroArch-GS starts from an existing photogrammetric mesh and oriented oblique aerial images. To complement the aerial observations, we additionally capture ground images with RTK positioning. These ground images are further refined through PPK correction in DJI Terra and integrated with the aerial images, yielding a unified aerial-ground triangulation result. The mesh then serves as an explicit proxy for scene geometry and appearance, guiding three core components: Mesh-Guided Anchor Initialization, Mesh-Guided Geometric Regularization, and Mesh-Guided Pseudo-View Supervision. During optimization, color supervision and geometric regularization jointly guide the Gaussian primitives to recover faithful appearance while staying close to the underlying scene structure. To train under limited hardware memory, we introduce an on-demand data fetching mechanism that keeps only a fixed-capacity CPU cache and loads images as needed for fast retrieval.

Rendering Results

We evaluate our method against several state-of-the-art baselines including 3DGS, Scaffold-GS, Octree-GS, and Horizon-GS. Our method consistently produces sharper details, fewer artifacts, and better geometric fidelity.

Scenario
Scene
Method
After
Before
3DGS [2023]
Ours

Training Efficiency and Hardware Overhead

To break capacity-throughput trade-off and efficiently handle unbounded dataset sizes, we introduce a streaming architecture termed efficient lazy loading. Our method significantly reduces training time and decreases both GPU and CPU memory consumption.

This experiment is conducted on the 3DGS model, optimized for 10k iterations on the Museum dataset which contains 1016 training images.
This experiment is conducted on the 3DGS model, optimized for 10k iterations on the Museum dataset which contains 1016 training images.

BibTeX

@article{Wang2026HeteroArchGS,
  title={HeteroArch-GS: Aerial-Ground Mesh-guided Gaussian Splatting for Heterogeneous Architectural Landmarks with a Real-World Dataset},
  author={Junfan Wang, Han Hu, Zhihao Jia, Yang Jia, Bo Xiang, Jiwei Deng, Qing Zhu},
  year={2026},
}

Acknowledgements

This study was supported in part by the National Natural Science Foundation of China (Project No. U25A20772, 42230102) and the Natural Science Foundation of Sichuan Province under Grant 2026NSFSCZY0054.