Xiaoyu Lu

Publications

I'm a machine learning scientist at Amazon, working on various topics in Machine Learning/Statistical Modelling. Recently I have been working on explainable models to emulate complicated supply chain systems. I am proficient with Python, AWS tools including cloud computing, SageMaker, State Machine etc.. I am also familiar with data analytics tools such as data pipelines and SQL. Prior to Amazon I did my PhD in probabilistic machine learning at University of Oxford, supervised by Prof. Yee Whye Teh in the Machine Learning group at the Department of Statistics, with research experience in generative models, Gaussian Processes, MCMC, Bayesian inference, deep learning, recommender systems, and reinforcement learning. Before my PhD, I did my undergraduate in Mathematics and Statistics at University of Oxford with the MMath degree, during which I have topped the department in both the bachelor (3rd year) and the master year (4th year). My fourth year thesis is on Recommender System for movie recommendations using collaborative filtering.

I spent a summer at Microsoft Research, Cambridge as a research intern, and worked on a reinforcement learning project. Specifically, we use a latent variable model in imitation learning to learn different playstyles in games. I also interned at Amazon working on Bayesian Optimisation in non-Euclidean spaces. I have also enjoyed my internships working as a Quantitative Researcher in the financial industry, where I worked on model validation and financial derivative pricing models.

Curriculum Vitae (last updated: Feb 2024)

Google scholar page

LinkedIn: LinkedIn

Personal E-mail: luxiaoyu644@gmail.com; Work E-mail: luxiaoyu@amazon.com

Publications

Daisee: Adaptive Importance Sampling by Balancing Exploration and Exploitation

Abstract: We study adaptive importance sampling (AIS) as an online learning problem and argue for the importance of the trade-off between exploration and exploitation in this adaptation. Borrowing ideas from the online learning literature, we propose Daisee, a partition-based AIS algorithm. We further introduce a notion of regret for AIS and show that Daisee has O(√T(logT)^(3/4)) cumulative pseudo-regret, where T is the number of iterations. We then extend Daisee to adaptively learn a hierarchical partitioning of the sample space for more efficient sampling and confirm the performance of both algorithms empirically.

Xiaoyu Lu, Tom Rainforth, Yee Whye Teh
Scandinavian Journal of Statistics, 2023.
pdf
Additive Gaussian Processes Revisited

Abstract: Gaussian Process (GP) models are a class of flexible non-parametric models that have rich representational power. By using a Gaussian process with additive structure, complex responses can be modelled whilst retaining interpretability. Previous work showed that additive Gaussian process models require high-dimensional interaction terms. We propose the orthogonal additive kernel (OAK), which imposes an orthogonality constraint on the additive functions, enabling an identifiable, low-dimensional representation of the functional relationship. We connect the OAK kernel to functional ANOVA decomposition, and show improved convergence rates for sparse computation methods. With only a small number of additive low-dimensional terms, we demonstrate the OAK model achieves similar or better predictive performance compared to black-box models, while retaining interpretability

Xiaoyu Lu, Alexis Boukouvalas, James Hensman
ICML 2022.
pdf | bibtex | github
Causal Bayesian Optimization

Abstract: This paper studies the problem of globally optimizing a variable of interest that is part of a causal model in which a sequence of interventions can be performed. This problem arises in biology, operational research, communications and, more generally, in all fields where the goal is to optimize an output metric of a system of interconnected nodes. Our approach combines ideas from causal inference, uncertainty quantification and sequential decision making. In particular, it generalizes Bayesian optimization, which treats the input variables of the objective function as independent, to scenarios where causal information is available. We show how knowing the causal graph significantly improves the ability to reason about optimal decision making strategies decreasing the optimization cost while avoiding suboptimal solutions. We propose a new algorithm called Causal Bayesian Optimization (cbo). cbo automatically balances two trade-offs: the classical exploration-exploitation and the new observation-intervention, which emerges when combining real interventional data with the estimated intervention effects computed via do-calculus. We demonstrate the practical benefits of this method in a synthetic setting and in two real-world applications.

Virginia Aglietti, Xiaoyu Lu, Andrei Paleyes, Javier González
AISTATS 2020.
pdf | bibtex
Structure Mapping for Transferability of Causal Models

Abstract: Human beings learn causal models and constantly use them to transfer knowledge between similar environments. We use this intuition to design a transfer-learning framework using object-oriented representations to learn the causal relationships between objects. A learned causal dynamics model can be used to transfer between variants of an environment with exchangeable perceptual features among objects but with the same underlying causal dynamics. We adapt continuous optimization for structure learning techniques to explicitly learn the cause and effects of the actions in an interactive environment and transfer to the target domain by categorization of the objects based on causal knowledge. We demonstrate the advantages of our approach in a gridworld setting by combining causal model-based approach with model-free approach in reinforcement learning.

Purva Pruthi, Javier Gonzlez, Xiaoyu Lu, Madalina Fiterau
ICML 2020 Inductive Biases, Invariances and Generalization in Reinforcement Learning Workshop., 2020.
pdf | bibtex
FactoredRL: Leveraging factored graphs for deep reinforcement learning

Abstract: We propose a simple class of deep reinforcement learning (RL) methods, called FactoredRL, that can leverage factored environment structures to improve the sample efficiency of existing model-based and model-free RL algorithms. In tabular and linear approximation settings, the factored Markov decision process literature has shown exponential improvements in sample efficiency by leveraging factored environment structures. We extend this to deep RL algorithms that use neural networks. For model-based algorithms, we use the factored structure to inform the state transition network architecture and for model-free algorithms we use the factored structure to inform the Q network or the policy network architecture. We demonstrate that doing this significantly improves sample efficiency in both discrete and continuous state-action space settings.

Bharathan Balaji, Petros Christodoulou, Xiaoyu Lu, Byungsoo Jeon Jordan Bell-Masterson
NeurIPS Workshop on Deep Reinforcement Learning., 2020.
pdf | bibtex
Structured Variationally Auto-encoded Optimization

Abstract: We tackle the problem of optimizing a black-box objective function defined over a highly-structured input space. This problem is ubiquitous in science and engineering. In machine learning, inferring the structure of a neural network or the Automatic Statistician (AS), where the optimal kernel combination for a Gaussian process is selected, are two important examples. We use the \as as a case study to describe our approach, that can be easily generalized to other domains. We propose an Structure Generating Variational Auto-encoder (SG-VAE) to embed the original space of kernel combinations into some low-dimensional continuous manifold where Bayesian optimization (BO) ideas are used. This is possible when structural knowledge of the problem is available, which can be given via a simulator or any other form of generating potentially good solutions. The right exploration-exploitation balance is imposed by propagating into the search the uncertainty of the latent space of the SG-VAE, that is computed using variational inference. The key aspect of our approach is that the SG-VAE can be used to bias the search towards relevant regions, making it suitable for transfer learning tasks. Several experiments in various application domains are used to illustrate the utility and generality of the approach described in this work.

Xiaoyu Lu, Javier Gonzlez, Zhenwen Dai, Neil D. Lawrence
ICML 2018.
pdf | bibtex
On Exploration, Exploitation and Learning in Adaptive Importance Sampling

Abstract: We study adaptive importance sampling (AIS) as an online learning problem and argue for the importance of the trade-off between exploration and exploitation in this adaptation. Borrowing ideas from the bandits literature, we propose Daisee, a partition-based AIS algorithm. We further introduce a notion of regret for AIS and show that Daisee has O(√T(logT)^(3/4)) cumulative pseudo-regret, where T is the number of iterations. We then extend Daisee to adaptively learn a hierarchical partitioning of the sample space for more efficient sampling and confirm the performance of both algorithms empirically.

Xiaoyu Lu, Tom Rainforth, Yuan Zhou, Jan-Willem van de Meent Yee Whye Teh
arXiv 2018.
pdf | bibtex
Inference trees: Adaptive inference with exploration

Abstract: We introduce inference trees (ITs), a new class of inference methods that build on ideas from Monte Carlo tree search to perform adaptive sampling in a manner that balances exploration with exploitation, ensures consistency, and alleviates pathologies in existing adaptive methods. ITs adaptively sample from hierarchical partitions of the parameter space, while simultaneously learning these partitions in an online manner. This enables ITs to not only identify regions of high posterior mass, but also maintain uncertainty estimates to track regions where significant posterior mass may have been missed. ITs can be based on any inference method that provides a consistent estimate of the marginal likelihood. They are particularly effective when combined with sequential Monte Carlo, where they capture long-range dependencies and yield improvements beyond proposal adaptation alone.

Tom Rainforth, Yuan Zhou, Xiaoyu Lu, Yee Whye Teh Frank Wood, Hongseok Yang Jan-Willem van de Meent
arXiv 2018.
pdf | bibtex
Relativistic Monte Carlo

Abstract: Hamiltonian Monte Carlo (HMC) is a popular Markov chain Monte Carlo (MCMC) algorithm that generates proposals for a Metropolis-Hastings algorithm by simulating the dynamics of a Hamiltonian system. However, HMC is sensitive to large time discretizations and performs poorly if there is a mismatch between the spatial geometry of the target distribution and the scales of the momentum distribution. In particular the mass matrix of HMC is hard to tune well. In order to alleviate these problems we propose relativistic Hamiltonian Monte Carlo, a version of HMC based on relativistic dynamics that introduce a maximum velocity on particles. We also derive stochastic gradient versions of the algorithm and show that the resulting algorithms bear interesting relationships to gradient clipping, RMSprop, Adagrad and Adam, popular optimisation methods in deep learning. Based on this, we develop relativistic stochastic gradient descent by taking the zero-temperature limit of relativistic stochastic gradient Hamiltonian Monte Carlo. In experiments we show that the relativistic algorithms perform better than classical Newtonian variants and Adam.

Xiaoyu Lu, Valerio Perrone, Leonard Hasenclever, Yee Whye Teh Sebastian J. Vollmer
AISTATS 2017.
pdf | bibtex
Collaborative Filtering with Side Information: a Gaussian Process Perspective

Abstract: We tackle the problem of collaborative filtering (CF) with side information, through the lens of Gaussian Process (GP) regression. Driven by the idea of using the kernel to explicitly model user-item similarities, we formulate the GP in a way that allows the incorporation of low-rank matrix factorisation, arriving at our model, the Tucker Gaussian Process (TGP). Consequently, TGP generalises classical Bayesian matrix factorisation models, and goes beyond them to give a natural and elegant method for incorporating side information, giving enhanced predictive performance for CF problems. Moreover we show that it is a novel model for regression, especially well-suited to grid-structured data and problems where the dependence on covariates is close to being separable.

Hyunjik Kim, Xiaoyu Lu, Seth Flaxman, Yee Whye Teh
ArXiv, 2016.
pdf | bibtex