Sensitive data are often owned and stored in a decentralised fashion. Jointly learning these decentralised data often requires sharing the sensitive information across servers. A typical example is healthcare data, which are sensitive and collected by different medical institutes. In order to learn decentralised data in a private way, the challenges are at least twofold: 1) balancing the privacy and utility tradeoff, and 2) handling the heterogeneity among different servers. In this talk, I will talk about the federated differential privacy, which is designed specifically for this challenging task. I will start with general background, followed by a few papers of mine on federated differential privacy.
Learning from imperfect information remains a fundamental challenge for the reliable and safe deployment of machine learning (ML) systems. In this talk, we present an overview of our work on three key directions: weakly supervised learning, adaptation under distribution shift, and reward modeling for reinforcement learning. We discuss methodological advances in each area and highlight how these approaches contribute to improving the robustness and trustworthiness of ML systems.
Semi-supervised learning (SSL) is the problem of finding missing labels from a partially labelled data set. The heuristic one uses is that "similar feature vectors should have similar labels". The notion of similarity between feature vectors explored in this talk comes from a graph-based geometry where an edge is placed between feature vectors that are closer than some connectivity radius. A natural variational solution to the SSL is to minimise a Dirichlet energy built from the graph topology. And a natural question is to ask what happens as the number of feature vectors goes to infinity? In this talk I will give results on the asymptotics of graph-based SSL using an optimal transport topology. The results will include a lower bound on the number of labels needed for consistency and, time permitting, some recent extensions to infinite dimensional settings.
The problem of predicting unobserved entries of a binary data matrix is known as 1-bit matrix completion. We develop an empirical Bayes method for 1-bit matrix completion motivated by the Efron–Morris estimator for a normal mean matrix, a matrix generalization of the James–Stein estimator that shrinks the singular values towards zero. The proposed method exploits an underlying low-rank structure of binary matrices, similarly to the multidimensional item response theory. Simulation studies and real-data applications demonstrate that the proposed method performs well in terms of both prediction accuracy and uncertainty quantification.
As interaction with the physical world is inevitable in robot learning, ensuring safety becomes a critical concern. Consequently, efficient data collection and robust model training are essential to minimize risks to humans, robots, and the surrounding environment. In this talk, I will present our recent work on sample‑efficient learning methods and robust training strategies that advance safe and reliable robot learning.
In many high-risk applications, reliable probability estimates of predictions from machine learning models are crucial, and calibration is a standard measure of this reliability. In this talk, I will introduce basic concepts of calibration and present our recent work on generalization error analysis of calibration measures and their connections to boosting and neural networks.
Synthetic data are increasingly used in computational statistics and machine learning. Some applications relate to privacy concerns, to data augmentation, and to method development. A particular interest lies in anomaly detection. Synthetic data should reflect the underlying distribution of the real data, being faithful but also showing some variability. In this talk we focus on networks as a data type, such as networks of transactions between agents. This data type poses additional challenges due to the complex dependence which it often represents. The talk will present a new idea for synthetic network generation. It will also include a statistical method for assessing their quality. Theoretical guarantees for both, the quality assessment and the data generation, are based on Stein's method. The talk will touch on these guarantees. It will conclude with some ideas for non-network data generation. This talk is based on joint work with Wenkai Xu.