E, WENKE (2026) On Deep Learning for Deployable Bird’s-Eye-View Perception Using On-Vehicle 360-Degree Vision. Doctoral thesis, Durham University.
| Full text not available from this repository. Author-imposed embargo until 26 February 2027. |
Abstract
Bird’s-eye-view (BEV) scene understanding provides a natural spatial interface for autonomous driving by aligning perception outputs with the ground-plane structure of road environments. However, making BEV perception deployable remains challenging when the sensing stack is constrained to a minimal camera configuration, particularly a single on-vehicle 360-degree camera whose limited pixel budget must cover the full panorama. Under this setting, BEV prediction must resolve an under-constrained perspective-to-metric transformation in the presence of depth ambiguity and long-range sparsity, while remaining robust to temporal instability induced by ego-motion, dynamic objects and occlusions. This thesis addresses these coupled challenges by developing a coherent pathway towards practical camera-centric BEV segmentation under realistic deployment constraints.
This thesis first establishes a problem formulation and benchmark setting that makes progress measurable under single-camera constraints. It introduces a real-world dataset and evaluation protocol for BEV vehicle segmentation from a single 360-degree panoramic camera, together with an end-to-end modelling pipeline that maps spherical panoramic imagery into metrically consistent BEV representations. By formalising the task setting and providing reproducible baselines tailored to spherical imaging geometry, this contribution clarifies where and why panoramic BEV mapping fails in realistic driving scenes, and supplies the experimental foundation required to study subsequent training-time and temporal improvements in a controlled manner.
Building on this benchmark, this thesis then investigates how richer sensing can be exploited only during training to strengthen a camera-only BEV model at deployment. The thesis proposes a cross-modality knowledge distillation framework in which a LiDAR–camera fusion Teacher transfers geometric and semantic BEV knowledge to a lightweight panoramic camera Student, while explicitly addressing teacher–student mismatch under large modality discrepancies. Key to this design is a unified panoramic representation that supports stable alignment across modalities, together with a voxel-aligned view transformation pipeline and soft-gated fusion mechanism that preserve geometric fidelity while remaining compatible with efficient camera backbones. In this way, the framework improves single-camera BEV performance without increasing inference-time sensor requirements, directly targeting deployability.
Furthermore, this thesis addresses a complementary limitation that persists even with strong training-time supervision: camera-based BEV prediction is often temporally brittle when operating in a single-frame. To improve robustness under dynamic scenes, the thesis introduces a spatial-temporal Mixture-of-Experts framework for adaptive multi-frame aggregation in BEV space. Rather than applying a homogeneous temporal rule everywhere, the approach employs specialised experts aligned with different motion regimes (static, slow-motion, and fast-motion), guided by a lightweight motion-aware routing mechanism, and modulates temporal fusion using a decay strategy to suppress outdated or misleading history. This design aims to convert a fixed temporal context window into more consistent BEV predictions while maintaining low computational overhead suitable for deployment, and it further provides interpretable evidence of expert specialisation aligned with scene dynamics.
Across these three contributions, the thesis connects benchmark design, training-only multimodal supervision and efficient temporal modelling into a unified narrative for deployable 360-degree vision-based BEV segmentation. Extensive experimental evaluations and qualitative analyses across multiple benchmark datasets consistently demonstrate improved accuracy–efficiency trade-offs and increased robustness relative to prior state-of-the-art approaches, encompassing evaluations aligned with the proposed real-world benchmark, cross-dataset validation and temporal modelling studies on a standard multi-view driving benchmark, without compromising the minimal sensing assumptions that motivate deployment. Taken together, the thesis advances camera-centric BEV perception significantly closer to practical autonomous driving impact by systematically resolving what to measure, how to learn from richer sensors without deploying them, and how to remain stable over time in dynamic scenes.
| Item Type: | Thesis (Doctoral) |
|---|---|
| Award: | Doctor of Philosophy |
| Keywords: | Deep learning, Autonomous driving, BEV |
| Faculty and Department: | Faculty of Science > Computer Science, Department of |
| Thesis Date: | 2026 |
| Copyright: | Copyright of this thesis is held by the author |
| Deposited On: | 26 Feb 2026 12:26 |



