Ambroise Odonnat

In front of TUM in Munich

I am a Ph.D. student @ Huawei Noah’s Ark Lab & Inria in Paris supervised by Romain Tavenard, Laetitia Chapel, and Ievgen Redko.

I am interested in improving the core understanding of Transformers by conducting theoretical study and large-scale experiments on:

Large language models
Out-of-distribution generalization
Transformers training and fine-tuning
Vision Transformers and Time Series forecasting

I was lucky to receive an ICML Oral Award, an ICASSP Oral Award, and a QBIN Best Flash Talk Award for my research in these areas. On a more amusing (and surprising 🙃) note, one of my recent articles was featured in Forbes.

I enjoy working both with a few collaborators and as part of a larger team, contributing to open-source libraries and communicating about my research. I maintain a research blog, logB , and have had the privilege to present my research at leading institutions such as EPFL, ENS Ulm, and Criteo.

I graduated from Ecole des Ponts ParisTech in 2023 and hold a master’s degree from ENS Paris-Saclay in Mathematics, Vision, and Machine Learning (MVA).

Don’t hesitate to reach out for possible collaborations or questions regarding my research!

news

Jan 30, 2025	📑 New preprint on the training dynamics in Transformers: Clustering Heads .
Jan 22, 2025	🥳 DICL was accepted @ICLR 2025.
Dec 18, 2024	🥳 Easing Optimization Paths: A Circuit Perspective was accepted @ICASSP 2025.
Nov 12, 2024	🎉 Very happy to see Large Language Models as Markov Chains featured in Forbes!
Oct 02, 2024	📑 New preprint: Large Language Models as Markov Chains .

selected publications

ICASSP

Easing Optimization Paths: A Circuit Perspective

Ambroise Odonnat* , Wassim Bouaziz* , and Vivien Cabannes

ICASSP Oral, 2025.

Abs arXiv PDF Code Slides

Gradient descent is the method of choice for training large artificial intelligence systems. As these systems become larger, a better understanding of the mechanisms behind gradient training would allow us to alleviate compute costs and help steer these systems away from harmful behaviors. To that end, we suggest utilizing the circuit perspective brought forward by mechanistic interpretability. After laying out our intuition, we illustrate how it enables us to design a curriculum for efficient learning in a controlled setting. The code is available at https://github.com/facebookresearch/pal.
arXiv

Clustering Head: A Visual Case Study of the Training Dynamics in Transformers

Ambroise Odonnat , Wassim Bouaziz , and Vivien Cabannes

Preprint, 2024.

Abs arXiv PDF Code

This paper introduces a visual sandbox designed to explore the training dynamics of a small-scale transformer model, with the embedding dimension constrained to d = 2. This restriction allows for a comprehensive two-dimensional visualization of each layer’s dynamics. Through this approach, we gain insights into training dynamics, circuit transferability, and the causes of loss spikes, including those induced by the high curvature of normalization layers. We propose strategies to mitigate these spikes, demonstrating how good visualization facilitates the design of innovative ideas of practical interest. Additionally, we believe our sandbox could assist theoreticians in assessing essential training dynamics mechanisms and integrating them into future theories. The code is available at https://github.com/facebookresearch/pal.
arXiv

Large Language Models as Markov Chains

Oussama Zekri* , Ambroise Odonnat* , Abdelhakim Benecheab , and 3 more authors

Preprint, 2024.

Abs arXiv PDF Slides

Large language models (LLMs) have proven to be remarkably efficient, both across a wide range of natural language processing tasks and well beyond them. However, a comprehensive theoretical analysis of the origins of their impressive performance remains elusive. In this paper, we approach this challenging task by drawing an equivalence between generic autoregressive language models with vocabulary of size T and context window of size K and Markov chains defined on a finite state space of size O(T^K). We derive several surprising findings related to the existence of a stationary distribution of Markov chains that capture the inference power of LLMs, their speed of convergence to it, and the influence of the temperature on the latter. We then prove pre-training and in-context generalization bounds and show how the drawn equivalence allows us to enrich their interpretation. Finally, we illustrate our theoretical guarantees with experiments on several recent LLMs to highlight how they capture the behavior observed in practice.
NeurIPS

MANO: Exploiting Matrix Norm for Unsupervised Accuracy Estimation Under Distribution Shifts

Renchunzi Xie* , Ambroise Odonnat* , Vasilii Feofanov* , and 3 more authors

NeurIPS, 2024.

Abs arXiv PDF Code Poster Slides

Estimating test accuracy without access to the ground-truth test labels under varying test environments is a challenging, yet extremely important problem in the safe deployment of machine learning algorithms. Existing works rely on the information from either the outputs or the extracted features of neural networks to formulate an estimation score correlating with the ground-truth test accuracy. In this paper, we investigate–both empirically and theoretically–how the information provided by the gradients can be predictive of the ground-truth test accuracy even under a distribution shift. Specifically, we use the norm of classification-layer gradients, backpropagated from the cross-entropy loss after only one gradient step over test data. Our key idea is that the model should be adjusted with a higher magnitude of gradients when it does not generalize to the test dataset with a distribution shift. We provide theoretical insights highlighting the main ingredients of such an approach ensuring its empirical success. Extensive experiments conducted on diverse distribution shifts and model structures demonstrate that our method significantly outperforms state-of-the-art algorithms.
ICML

SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention

Romain Ilbert* , Ambroise Odonnat* , Vasilii Feofanov , and 4 more authors

ICML Oral, 2024.

Abs arXiv PMLR PDF Code Poster Slides

Transformer-based architectures achieved breakthrough performance in natural language processing and computer vision, yet they remain inferior to simpler linear baselines in multivariate long-term forecasting. To better understand this phenomenon, we start by studying a toy linear forecasting problem for which we show that transformers are incapable of converging to their true solution despite their high expressive power. We further identify the attention of transformers as being responsible for this low generalization capacity. Building upon this insight, we propose a shallow lightweight transformer model that successfully escapes bad local minima when optimized with sharpness-aware optimization. We empirically demonstrate that this result extends to all commonly used real-world multivariate time series datasets. In particular, SAMformer surpasses the current state-of-the-art model TSMixer by 14.33% on average, while having 4 times fewer parameters. The code is available at https://github.com/romilbert/samformer.
AISTATS

Leveraging Ensemble Diversity for Robust Self-Training in the Presence of Sample Selection Bias

Ambroise Odonnat , Vasilii Feofanov , and Ievgen Redko

AISTATS, 2024.

Abs arXiv PMLR PDF Code Poster Slides

Self-training is a well-known approach for semi-supervised learning. It consists of iteratively assigning pseudo-labels to unlabeled data for which the model is confident and treating them as labeled examples. For neural networks, softmax prediction probabilities are often used as a confidence measure, although they are known to be overconfident, even for wrong predictions. This phenomenon is particularly intensified in the presence of sample selection bias, i.e., when data labeling is subject to some constraints. To address this issue, we propose a novel confidence measure, called \mathcalT-similarity, built upon the prediction diversity of an ensemble of linear classifiers. We provide the theoretical analysis of our approach by studying stationary points and describing the relationship between the diversity of the individual members and their performance. We empirically demonstrate the benefit of our confidence measure for three different pseudo-labeling policies on classification datasets of various data modalities. The code is available at https://github.com/ambroiseodt/tsim.