Cookies

We use cookies to ensure that we give you the best experience on our website. By continuing to browse this repository, you give consent for essential cookies to be used. You can read more about our Privacy and Cookie Policy.


Durham e-Theses
You are in:

Advancing Representation Learning and Generative Models for Deep Learning-Based Image Inpainting

CHEN, SHUANG (2025) Advancing Representation Learning and Generative Models for Deep Learning-Based Image Inpainting. Doctoral thesis, Durham University.

[img]
Preview
PDF
Available under License Creative Commons Attribution 3.0 (CC BY).

42Mb

Abstract

Image inpainting is a computer vision task that aims to reconstruct an image based on the visible pixels of a damaged or corrupted image with missing regions. Its applications span across image processing and computer vision tasks such as photo editing, objective removal and depth completion. However, significant challenges remain, particularly in remaining the insufficient information and capturing long-range dependencies for improving the overall fidelity and visual quality.

We first explore the use of geometric features to guide the insufficient visual feature for improving the fidelity of facial image inpainting, and the feasibility of using image inpainting for cleft lip surgery to support surgeons as adjuncts to adjust surgical technique and improve surgical results. To achieve this idea, we collect two real-world cleft lip datasets to conduct experiments with a proposed single-stage multi-task image inpainting framework that is capable of covering a cleft lip and generating a lip and nose without a cleft. The results are assessed by expert cleft lip surgeons to demonstrate the feasibility of the proposed methods. Additionally, we embed this framework as software and released it on CodeOcean and Github, to make it convenient and equal to use for both patients and surgeons.

Although insufficient information can be provided and supplemented by landmark points, such approach only works for facial image inpainting and cannot be transferred to natural or architectural scenes, which are more common in the real-world senario. To more effectively use information while more adaptively maintaining fidelity, we propose an end-to-end High-quality INpainting Transformer, abbreviated as HINT, which consists of a novel Mask-aware Pixel-shuffle Downsampling (MPD) module to preserve the visible information extracted from the corrupted image while maintaining the integrity of the information available for high-level inference made within the model. Moreover, we propose a Spatially-activated Channel Attention Layer (SCAL), an efficient self-attention mechanism interpreting spatial awareness to model the corrupted image at multiple scales. To further enhance the effectiveness of SCAL, motivated by recent advanced in speech recognition, we introduce a sandwich structure that places feed-forward networks before and after the SCAL module. We demonstrate the superior performance of HINT compared to contemporary state-of-the-art models on four datasets, CelebA, CelebA-HQ, Places2, and Dunhuang.

Furthermore, capturing global contextual understanding is a crucial challenge to restore missing regions of images with semantically coherent content. Recent advancements have incorporated transformers, leveraging their ability to understand global interactions. However, these methods face computational inefficiencies and struggle to maintain fine-grained details. To overcome these challenges, we introduce ${M\times T}$ composed of the proposed Hybrid Module (HM), which combines Mamba with the transformer in a synergistic manner. Selective State Space Model
(SSM), known as Mamba, is adept at efficiently processing long sequences with linear computational costs, making it an ideal complement to the transformer for handling long-scale data interactions. However, such method method of directly adopting the vanilla SSM does not solve the inherent limitation of SSM unidirectional scanning the data, making it lack 2D spatial awareness. This insight introduces two key challenges: (i) how to maintain the continuity and consistency of pixel adjacency for pixel-level dependencies learning while processing the SSM recurrence; and (ii) how to effectively integrate 2D spatial awareness to the predominant linear recurrent-based SSMs. To solve these challenges, we spatially enhance the SSM to propose SEM-Net with efficient pixel modelling for image inpainting, involving the Snake Mamba Block (SMB) and Spatially-Enhanced Feedforward Network. These innovations enable SEM-Net to outperform state-of-the-art inpainting methods in capturing long-range dependency and enhancement in spatial consistency.


We validate the effectiveness of our methods through extensive experiments and qualitative analysis. Our approaches surpass the state-of-the-art (SoTA) in Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM), Perceptual Similarity (LPIPS), L1, and Fréchet Inception Distance (FID). All contributions have been accepted by peer-reviewed conferences or journals.

Item Type:Thesis (Doctoral)
Award:Doctor of Philosophy
Keywords:Deep Learning, Generative Model, Image Processing, Image Inpainting
Faculty and Department:Faculty of Science > Computer Science, Department of
Thesis Date:2025
Copyright:Copyright of this thesis is held by the author
Deposited On:10 Nov 2025 10:03

Social bookmarking: del.icio.usConnoteaBibSonomyCiteULikeFacebookTwitter