We use cookies to ensure that we give you the best experience on our website. By continuing to browse this repository, you give consent for essential cookies to be used. You can read more about our Privacy and Cookie Policy.

Durham e-Theses
You are in:

Scalable Methodologies and Analyses for Modality Bias and Feature Exploitation in Language-Vision Multimodal Deep Learning

WINTERBOTTOM, THOMAS,IAIN (2023) Scalable Methodologies and Analyses for Modality Bias and Feature Exploitation in Language-Vision Multimodal Deep Learning. Doctoral thesis, Durham University.

[img]PDF (Thesis) - Accepted Version
Available under License Creative Commons Attribution Share Alike 3.0 (CC BY-SA).



Multimodal machine learning benchmarks have exponentially grown in both capability and popularity over the last decade. Language-vision question-answering tasks such as Visual Question Answering (VQA) and Video Question Answering (video-QA) have ---thanks to their high difficulty--- become a particularly popular means through which to develop and test new modelling designs and methodology for multimodal deep learning. The challenging nature of VQA and video-QA tasks leaves plenty of room for innovation at every component of the deep learning pipeline: from dataset to modelling methodology. Such circumstances are ideal for innovating in the space of language-vision multimodality. Furthermore, the wider field is currently undergoing an incredible period of growth and increasing interest. I therefore aim to contribute to multiple key components of the VQA and video-QA pipeline, but specifically in a manner such that my contributions remain relevant, ‘scaling’ with the revolutionary new benchmark models and datasets of the near future instead of being rendered obsolete by them. The work in this thesis: highlights and explores the disruptive and problematic presence of language bias in the popular TVQA video-QA dataset, and proposes a dataset-invariant method to identify subsets that respond to different modalities; thoroughly explores the suitability of bilinear pooling as a language-vision fusion technique in video-QA, offering experimental and theoretical insight, and highlighting the parallels in multimodal processing with neurological theories; explores the nascent visual equivalent of languague modelling (`visual modelling') in order to boost the power of visual features; and proposes a dataset-invariant neurolinguistically-inspired labelling scheme for use in multimodal question-answering. I explore the positive and negative results that my experiments across this thesis yield. I conclude by discussing the limitations of my contributions, and conclude with proposals for future directions of study in the areas I contribute to.

Item Type:Thesis (Doctoral)
Award:Doctor of Philosophy
Keywords:Deep Learning, Multimodal Deep Learning, Multimodality, Modality Bias, Textual Bias, VQA, Video-QA, Generative Pretraining, Neurolinguistic Machine Learning
Faculty and Department:Faculty of Science > Computer Science, Department of
Thesis Date:2023
Copyright:Copyright of this thesis is held by the author
Deposited On:29 Mar 2023 16:02

Social bookmarking: del.icio.usConnoteaBibSonomyCiteULikeFacebookTwitter