On Inductive Biases for Machine Learning in Data-Constrained Settings


Learning with limited data is one of the biggest problems of deep learning. Current, popular approaches to this issue consist in training models on huge amounts of data, labelled or not, before re-training the model on a smaller dataset of interest from the same modality. Intuitively, this technique allows the model to learn a general representation for some kind of data first, such as images. Then, fewer data should be required to learn a specific task for this particular modality. While this approach coined as transfer learning is very effective in domains such as computer vision or natural langage processing, it does not solve common problems of deep learning such as model interpretability or the overall need for data. This thesis explores a different answer to the problem of learning expressive models in data constrained settings. Instead of relying on big datasets to learn the parameters of a neural network, we will replace some of them by known functions reflecting the structure of the data. Very often, these functions will be drawn from the rich litterature of kernel methods. Indeed, many kernels can be interpreted, and/or allow for learning with few data. Our approach falls under the hood of inductive biases, which can be defined as hypothesis on the data at hand restricting the space of models to explore during learning. In the first two chapters of the thesis, we demonstrate the effectiveness of this approach in the context of sequences, such as sentences in natural language or protein sequences, and graphs, such as molecules. We also highlight the relationship between our work and recent advances in deep learning. The last chapter of this thesis focuses on convex machine learning models. Here, rather than proposing new models, we wonder which proportion of the samples in a dataset is really needed to learn a good model. More precisely, we study the problem of safe sample screening, i.e, executing simple tests to discard uninformative samples from a dataset even before fitting a machine learning model, without affecting the optimal model. Such techniques can be used to compress datasets or mine for rare samples.

PhD Thesis