Posts by Collection




INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving

Published in International Conference on Learning Representations (ICLR 2021), 2021

In learning-assisted theorem proving, one of the most critical challenges is to gener- alize to theorems unlike those seen at training time. In this paper, we introduce INT, an INequality Theorem proving benchmark designed to test agents’ generalization ability. INT is based on a theorem generator, which provides theoretically infinite data and allows us to measure 6 different types of generalization, each reflecting a distinct challenge, characteristic of automated theorem proving. In addition, INT provides a fast theorem proving environment with sequence-based and graph-based interfaces, conducive to performing learning-based research. We introduce base- lines with architectures including transformers and graph neural networks (GNNs) for INT. Using INT, we find that transformer-based agents achieve stronger test performance for most of the generalization tasks, despite having much larger out- of-distribution generalization gaps than GNNs. We further find that the addition of Monte Carlo Tree Search (MCTS) at test time helps to prove new theorems.

Download here

LISA: Language models of ISAbelle proofs

Published in 6th Conference on Artificial Intelligence and Theorem Proving (AITP 2021), 2021

We introduce an environment that allows interaction with an Is- abelle server in an incremental manner. With this environment, we mined the Isabelle standard library and the Archive of Formal Proofs (AFP) and extracted 183K lemmas and theorems. We built language models on this large corpus and showed their effectiveness in proving AFP theorems.

Download here

Can Network Flatness Explain the Training Speed-Generalisation Connection?

Published in Bayesian Deep Learning Workshop at the Thirty-fifth Conference on Neural Information Processing Systems (Neurips 2021), 2021

Recent work has shown that training speed, as estimated by the sum over training loss, is predictive of generalization performance. From a Bayesian perspective, this metric can be theoretically linked to marginal likelihood in linear models. However, it is unclear why the relationship holds for DNNs and what the underlying mechanisms are. We hypothesise that this relationship holds in DNNs because of network flatness, which causes both fast training speed and good generalization. We also investigated the hypothesis in varying settings and found that it might hold when the variance in the stochastic gradient estimation is moderate, with either logit averaging, or no data transformation at all. This paper specifies the conditions future works should impose when investigating the connecting mechanism.

Download here

Thor: Wielding Hammers to Integrate Language Models and Automated Theorem Provers

Published in NeurIPS 2022, 2022

In theorem proving, the task of selecting useful premises from a large library to unlock the proof of a given conjecture is crucially important. This presents a challenge for all theorem provers, especially the ones based on language models, due to their relative inability to reason over huge volumes of premises in text form. This paper introduces Thor, a framework integrating language models and automated theorem provers to overcome this difficulty. In Thor, a class of methods called hammers that leverage the power of automated theorem provers are used for premise selection, while all other tasks are designated to language models. Thor increases a language model’s success rate on the PISA dataset from 39% to 57%, while solving 8.2% of problems neither language models nor automated theorem provers are able to solve on their own. Furthermore, with a significantly smaller computational budget, Thor can achieve a success rate on the MiniF2F dataset that is on par with the best existing methods. Thor can be instantiated for the majority of popular interactive theorem provers via a straightforward protocol we provide.

Download here

Autoformalization with Large Language Models

Published in NeurIPS 2022, 2022

Autoformalization is the process of automatically translating from natural lan- guage mathematics to formal specifications and proofs. A successful autoformal- ization system could advance the fields of formal verification, program synthesis, and artificial intelligence. While the long-term goal of autoformalization seemed elusive for a long time, we show large language models provide new prospects towards this goal. We make the surprising observation that LLMs can correctly translate a significant portion (25.3%) of mathematical competition problems per- fectly to formal specifications in Isabelle/HOL. We demonstrate the usefulness of this process by improving a previously introduced neural theorem prover via training on these autoformalized theorems. Our methodology results in a new state-of-the-art result on the MiniF2F theorem proving benchmark, improving the proof rate from 29.6% to 35.2%.

Download here

Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs

Published in Arxiv preprint, 2022

The formalization of existing mathematical proofs is a notoriously difficult process. Despite decades of research on automation and proof assistants, writing formal proofs remains arduous and only accessible to a few experts. While previous studies to automate formalization focused on powerful search algorithms, no attempts were made to take advantage of available informal proofs. In this work, we introduce Draft, Sketch, and Prove (DSP), a method that maps informal proofs to formal proof sketches, and uses the sketches to guide an automated prover by directing its search to easier sub-problems. We investigate two relevant setups where informal proofs are either written by humans or generated by a language model. Our experiments and ablation studies show that large language models are able to produce well-structured formal sketches that follow the same reasoning steps as the informal proofs. Guiding an automated prover with these sketches enhances its performance from 20.9% to 39.3% on a collection of mathematical competition problems.

Download here


Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.