Depositor Login | Administrator Login

Contrastive Sentence Representation Learning: Retrieval, Reasoning and Perception

XIAO, CHENGHAO (2026) Contrastive Sentence Representation Learning: Retrieval, Reasoning and Perception. Doctoral thesis, Durham University.

Preview

PDF
9Mb

Abstract

Representation learning has been centered to many NLP and multimodal tasks. This thesis presents a humble but systematic exploration of a specific form of ideal representations, i.e., representations that enable similarity search in the Euclidean space.

Centered to the training of this form of representation models is a technique called contrastive learning, whose spirit is to push instances which ought having similar semantics closer in the representation space, while pulling ones irrelevant away. Representation models trained using this technique widely facilitate language-only and multimodal applications, such as search engines.

This thesis studies representation models trained by contrastive learning from three progressive perspectives:

It first visits the fundamentals of contrastive learning (Chapter 4, Chapter 5), focusing on the mechanism why it works, from theoretical properties such as isotropy, contextualization and learning dynamics. It also studies how these properties connect to more behaviors such as models' length generalization. Using these insights, an unsupervised contrastive learning approach is presented, reaching state-of-the-art performance on information retrieval benchmarks when it was released.

Going beyond traditional measurement and expectation of representation models' capabilities like retrieval, new challenges emerged especially under the context of collaborating with LLMs through paradigms like RAG. And we see the need to call into questions embedding models' generalization to OOD tasks (e.g., can it understand reasoning-level expressions?) and instruction-following capabilities. At that time, co-authors and I were the first to call for measuring reasoning capabilities of embedding models, proposing a visionary paradigm which we termed Reasoning as Retrieval (Chapter 6), which was lately widely adopted by the field.

From the benchmarking of reasoning and instruction-following capabilities, I saw the massive potential of generative models' representational power, as opposed to previous widely adoption of BERT-based and CLIP-based representation models. I grew the belief in March 2024 that ``training representation models is aligning their representational capabilities with their generative capabilities'', and my research interest grows into proving that the upper bound of this alignment increases by grounding the models in more and more modalities (Chapter 7, Chapter 8) (e.g., going from LLMs to MLLMs). We proposed pixel sentence representation learning, a unified framework to model the semantics of visual texts, and trained Pixel Linguist, a powerful model that can understand visual representation of texts. We then built the largest image and multimodal benchmark, MIEB - Massive Image Embedding Benchmark, defined by 8 capability categories we see necessary to measure from multimodal representation models in the new era, incorporating 130 image/multimodal tasks in 38 languages. In the context of multimodality, we again saw concrete evidence between how generative models' representational capabilities can be activated through orders-of-magnitude less alignment training than the CLIP paradigm, revealing latent alignment built in generative pretraining which needs to be activated to be similarity-matchable. We also saw early signs of scaling law between models performance on generative benchmarks and their representation upper bound post-contrastive learning. Such topics are being actively studied at the time of this thesis.

And the above marks the very beginning of the journey to pursue the ultimate form of a generalizable omni-modal representation model.

Item Type:	Thesis (Doctoral)
Award:	Doctor of Philosophy
Keywords:	contrastive learning, representation learning
Faculty and Department:	Faculty of Science > Computer Science, Department of
Thesis Date:	2026
Copyright:	Copyright of this thesis is held by the author
Deposited On:	25 Feb 2026 09:14

Social bookmarking:

Contrastive Sentence Representation Learning: Retrieval, Reasoning and Perception

Abstract

Quick links

Prospective students