Satellite Foundation Stereo Model for DSM Reconstruction via Multi-Scale Cascaded Adaptation

Liupeng Su¹, Han Hu*^1,*, Yuhao Ye¹, Zeyuan Dai^2,3, Junfan Wang¹, Zhihao Jia¹, Qianrui Guo⁴, Heyi Li⁴, Qing Zhu¹

¹ Faculty of Geosciences and Engineering, Southwest Jiaotong University · ² Department of Military Oceanography and Hydrography and Cartography, Dalian Naval Academy · ³ Key Laboratory of Hydrographic Surveying and Mapping of PLA, Dalian Naval Academy · ⁴ Institute of Remote Sensing Satellite, China Academy of Space Technology

preprint · 2026

Paper Code

Abstract

Foundation stereo models built on Vision Foundation Models (VFMs) have shown strong robustness to radiometric variation, structural boundaries, and occlusion in dense reconstruction. These properties are especially valuable for satellite stereo matching and Digital Surface Model (DSM) reconstruction, where image pairs may contain off-nadir occlusions, tile-boundary inconsistencies, shadows, weak texture, and multi-temporal appearance changes. Direct transfer remains unreliable, however, because satellite stereo involves extremely wide disparity ranges spanning both positive and negative values, together with a domain gap caused by different imaging geometry, viewing conditions, and scene distributions.

SatFS adapts FoundationStereo into a satellite foundation stereo framework through multi-scale cascaded adaptation. It keeps the VFM backbone frozen and injects VFM features into each pyramid level through lightweight side-tuning, so the model can use foundation representations from coarse wide-range matching to fine geometric refinement. The framework further decomposes global disparity search into tractable cascade stages, modulates local search ranges with pixel-wise uncertainty, preserves structural boundaries through VFM-guided bilateral cost-volume upsampling, and replaces dense pre-computed correlation with lightweight on-demand geometry encoding. This combination supports bidirectional satellite disparities while reducing correlation memory from O(H * W * W) to O(H * W * R).

Contributions

Multi-scale VFM side-tuning. SatFS brings frozen VFM features into every matching scale through lightweight adapters. Unlike single-scale VFM injection, this design uses foundation representations throughout the matching process, from coarse-level wide-range ambiguity reduction to fine-level geometric refinement.
Prior-guided Bilateral Upsampling (PBU). PBU uses VFM-derived monocular depth estimates and feature maps as bilateral-grid guidance during cost-volume upsampling. This preserves high-fidelity structural boundaries without requiring explicit edge detection, reducing the oversmoothing that often appears near rooftops, vegetation, and building edges.
Lightweight Geometry-Aware Encoding Volume (LGEV). LGEV redesigns IGEV-style iterative refinement by sampling only the disparity candidates needed during inference instead of storing a full pre-computed correlation volume. It reduces correlation memory from O(H * W * W) to O(H * W * R), achieves up to 86% memory reduction with less than 8% runtime overhead, and natively supports both positive and negative disparities.

Key Results

SatFS is evaluated from five complementary perspectives: in-distribution stereo accuracy, component ablation, cross-dataset generalisation, geographically diverse real-world DSM reconstruction, and remaining limitations. On WHU-Stereo, it achieves the best reported accuracy among the evaluated methods, with 9.57% D1 and 1.33 px EPE, outperforming both satellite-specific networks and the FoundationStereo baseline. On US3D, SatFS reaches 5.75% D1 and 1.01 px EPE, closely following FoundationStereo, whose RGB-oriented VFM pretraining is already well aligned with the multispectral US3D imagery.

Generalisation experiments show that the adaptation is not limited to the training distribution. On WHU-SSIDE, SatFS obtains 10.30% D1 and 1.63 px EPE, improving over the second-best method by 10.0% in D1 and 19.0% in EPE under zero-shot cross-domain evaluation. The paper also evaluates SatStereo, which contains multi-date WorldView imagery, to test robustness under cross-temporal appearance changes.

For DSM reconstruction, SatFS translates disparity accuracy into metrically reliable 3D surfaces across unseen sensors and regions. On the GF-7 Hawaii scene, it achieves 2.47 m RMSE and 1.26 m MAE, improving over G3D-SAT by 7.5% in RMSE and 5.3% in MAE. On the WorldView UCSD scene, it reaches 3.33 m RMSE and 2.37 m MAE, reducing RMSE by 22.9% and MAE by 12.5% relative to the strongest competing methods. Qualitatively, SatFS preserves sharper structural boundaries, richer rooftop details, and more faithful vegetation geometry across BJ-3, WV-3, and GeoEye-1 imagery, while remaining limited by severe off-nadir occlusion gaps, rooftop noise without explicit surface regularisation, and benchmark label noise from temporal misalignment between satellite imagery and LiDAR reference data.

Interactive Comparison

Drag the slider to sweep and compare; drag the rectangle on the thumbnail to move the view.

LeftvsSatFS (Ours)

Image

SatFS (Ours)

BJ-3 · 0.3 mUrumqi

Interactive Comparison (with LiDAR Ground Truth)

Drag the slider to sweep and compare; drag the rectangle on the thumbnail to move the view.

LeftvsSatFS (Ours)

Image

SatFS (Ours)

GF-7 · ~0.8 mHawaii ROI 1

BibTeX

@misc{su2026satfs,
  title={Satellite Foundation Stereo Model for DSM Reconstruction via Multi-Scale Cascaded Adaptation},
  author={Liupeng Su and Han Hu and Yuhao Ye and Zeyuan Dai and Junfan Wang and Zhihao Jia and Qianrui Guo and Heyi Li and Qing Zhu},
  note={preprint},
  year={2026}
}

Acknowledgements

This study was supported in part by the National Natural Science Foundation of China (Project No. U25A20772, 42230102) and the Natural Science Foundation of Sichuan Province under Grant 2026NSFSCZY0054.