Thoughts

I want to achieve mathematical superintelligence, so why am I doing LLMs now?

10 December 2024

I’ve recently defended my PhD, which is on AI for formal mathematics. Now I’m a full-time researcher at Mistral AI, working on reasoning. A significant shift in my work is that now I spend my working hours on informal reasoning rather than formal mathematics inside proof assistants, although I maintain some connections with academia and advise on formal mathematics projects.

To justify this shift in focus (mainly for myself), let me try to lay down some thoughts.

How is mathematics done differently by humans and AIs

The three stages of mathematics

One can roughly divide mathematical education into three stages:
The “pre-rigorous” stage, in which mathematics is taught in an informal, intuitive manner, based on examples, fuzzy notions, and hand-waving. (For instance, calculus is usually first introduced in terms of slopes, areas, rates of change, and so forth.) The emphasis is more on computation than on theory. This stage generally lasts until the early undergraduate years.
The “rigorous” stage, in which one is now taught that in order to do maths “properly”, one needs to work and think in a much more precise and formal manner (e.g. re-doing calculus by using epsilons and deltas all over the place). The emphasis is now primarily on theory; and one is expected to be able to comfortably manipulate abstract mathematical objects without focusing too much on what such objects actually “mean”. This stage usually occupies the later undergraduate and early graduate years.
The “post-rigorous” stage, in which one has grown comfortable with all the rigorous foundations of one’s chosen field, and is now ready to revisit and refine one’s pre-rigorous intuition on the subject, but this time with the intuition solidly buttressed by rigorous theory. (For instance, in this stage one would be able to quickly and accurately perform computations in vector calculus by using analogies with scalar calculus, or informal and semi-rigorous use of infinitesimals, big-O notation, and so forth, and be able to convert all such calculations into a rigorous argument whenever required.) The emphasis is now on applications, intuition, and the “big picture”. This stage usually occupies the late graduate years and beyond.
— Terence Tao, There’s more to mathematics than rigour and proofs

During the PhD viva, one of my examiners asked something in the vein of “Isn’t there a danger of doing futile work by teaching formal mathematics to AI when the final form of humans mathematics is post-rigorous?”. I answered by asking my two examiners which stage they think current SoTA AI is at. They agreed that it’s somewhere between pre-rigorous and rigorous. “So”, I smiled confidently, “formal mathematics is the necessary path to post-rigorous artificial intelligence.”

There is a huge issue with this answer - it assumed that humans and AIs take the same path to mathematical intelligence. Let me now take no bias nor presumption and contrast the two paths analytically.

The spice: signal

There is no question that a good signal is the key to intelligence. One can almost say that intelligence is proportional to (maybe $\log$) data quatity $\times$ signal quality. A good signal is one that is accurate, informative, and diverse. Signal quality also matters more in later stages of model training.

Take the classical regime of LLM training for example:

Pretraining: huge amount of diverse, relatively unfiltered data. The signal is the one-hot encoding of the next token (assuming no distillation). Sometimes the signal is just straight bad, e.g., you train on the hundreds of math crank papers on arxiv.
Fine-tuning: a smaller amount of highly focused data. Due to the smaller quantity, you are able to examine the data more closely and the model is usually only trained on the response part (you don’t need to allocate capacity to model the query part).
Reward tuning: a fairly small ammount of data. The signal is the reward or preference signal. Depending on the collection process, the signal can have varying quality but there’s more room for calibration than the previous stages.

From the pedagogical perspective, the best way of teaching reasoning works a little bit like this:

When the student is correct, the teacher helps the student to extract generalisable techniques.
When the student is wrong, the teacher helps the student to understand where and why they made the mistake, and how to prevent it.
When the student cannot produce an answer, the teacher helps the student to understand the problem better, give hints, and guide the student to progress.

From the simple boolean variable of right/wrong to the rich thought-provoking guides, the teacher provides a wide range of signals. As social animals, humans are trained to digest signals into usable knowledge and skills. But for machines, they need a more defined scheme of how they are supposed to learn. For language models, the design of objectives is to turn the rich scheme into a single scalar that is a function of logits over discrete tokens. Why is that so hard?

The AI disadvantage: autoregression

Autoregression has one direction, one shape. Want to optimise something? Better transform it to such a format that all the relevant context is rendered in order and the thinking process is also single-directional. Want to think before you speak? Put those Chain of Thoughts tokens before the final answer. Want to use tools to help you solve a problem? Define tools by their function name and arguments, and wrap the output from tools into result blocks. Want to backtrack when things are no longer promising? Do not erase the history and just concatenate the jumping history into a sequence of node traversals.

Christian Szegedy has a nice talk titled the inverse mindset of machine learning: instead of designing architectures to fit different data modalities, just take pretrained transformers and change how the data is presented. Pretrained transformers are very strong, and this mindset is the fundamental argument for LLMs’ efficiency. There is a single thing that people can concentrate efforts to improve, and everything else benefits from having a better starting LLM. The second stage of tweaking data can be done differently for different domains, and optimised easily since it’s so cheap.

In some way, an LLM is like a NAND gate. You can simulate any logical gate with a NAND gate so its upper bound is incredibly high. But the piping to create a circuit is non-trivial and not that cheap either (you need 4 NANDs plus wires to simulate 1 XOR) if you want to create matheamtical superintelligence instead of a customer support bot.

Just think about how you can teach an LLM to find the part of the answer that is wrong in a proof. You can do it but damn it’s hard and requires quite some engineering to ensure correctness. We’re very much in the land of flow engineering and not prompt engineering.

The human advantage: credit assignment

There is one thing that human experts or even non-experts are consistently better than AI at: credit assignment, finding the parts in a solution that are key to its success or failure. Humans often do it verbally and socially, without writing too much of this process online. Maybe that is precisely why AIs are not very good at it yet.

The AI advantage: scale

Obviously it is very hard for a human to learn about all of mathematics ever invented in history. It is also hard for an AI, but to a less extent. It is conceivable that an AI model can read all arxiv math papers in a very short time. To understand is a different matter.

Suppose we want to train AI models to learn a policy and a value function for the theorem proving MDP. The required compute is significant: for the value function to stay on-policy, the model needs to generate trajectories which is not easily massively parallelisable.

The unique advantage of AI is that scaling to multiple devices is always possible and therefore all learning can be on-policy in theory. Each human intelligence has only one copy and the transfer between any two intelligences is through a slow engaged absorption akin to distillation (for spirits, not models).

The myth: The generator-verifier gap and scaling

I don’t know when it started but the idea of a generator-verifier gap that one can exploit has possessed the minds at work in AI for maths. The ideal is simple: for mathematics, it is often easy to verify a proof than it is to generate one. You can even prove a difference of polynomial versus combinatorial complexity for verification versus generation. The gap certainly exists for humans, and we have been trying to exploit it. The practice is called mathematical education and research.

The common argument to extend this exploitation to computers is that if we can generate a lot of data with AI(s), and use verification to perform rejection sampling, then we end up with better, purer data than we started with. This data can be used for expert iteration and hence self-improvement.

To me, this thinking also has some flaws. The reasoning is anthropological and mixes up its logic across different paradigms. The computational complexity argument is very theoretical - humans very often (probabilistically too often if the complexity is really combinatorial) find proofs that are hundreds of pages long. There must be a special class of problems which humans are just very efficient at solving. The complexity of verification is indeed polynomial in some formalisms, but for AIs, due to the shortage of verification training data compared to generation, the former might be as hard as, or even harder than the latter. We might not have a big gap to start with.

The path of least resistance

There is no precedent for a swarm-like human intelligence, where each individual component is stupid but very scalable, to achieve any mathematical significance. So how can we even imagine there being one based on silicon?

The question of whether it can be achieved boils down to whether signals of enough quality and quantity can be acquired. Quantity is easy but quality and diversity do not come easy for AI. To ensure high enough quality, at one point people might have to bite the bullet and implement a formal system with two-way bridges from and to natural language instead of spending combinatorially much compute. The pain of doing so is large but constant, which is preferrable to a flexible, dynamic, and extraodinary pain that is like a ticking time bomb.

So why informal mathematics now?

The incumbent frontier model developers are not going to do things the formal way, before they have the path of least resistance (the formal turn) in their horizon. But we need frontier models both in terms of absolute quality and pareto frontier to do well in maths. It is perfectly fine to explore formal mathematics without frontier models, but you need to remind yourself very often whether your research will be rendered obsolete by newer frontier models, in this regime where scaling informally has far from saturated. This has happened to many traditional NLP tasks by BERT and GPT-3 and many reasoning tasks by O1.

Being fortunate enough to work in a language modelling team with a strong connection to the AI4Math community, pushing as hard as I can down the informal route seems like a smart move. If I can accelerate the realisation that formal tools are crucial, I will consider my mission accomplished. If I find informal mathematics alone to able to bring about mathematical superintelligence, I will also (very happily) accept that I was once wrong about the necessity of formality.

Thoughts after NeurIPS

18 Dec 2023

This is my first NeurIPS - last year I had three papers and no visa.

The general atmosphere is contagious: the echo chamber of AI is indeed comfortable and exciting. When so many people get together to have low-latency conversations that plastically shape their views, an electrifying air ensues. I have been enlightened countless times during a mere week. However, it’s worthwhile to remember that it is still an echo chamber and we should ground ourselves firmly in reality >95% of the time to not be intoxicated and completely detached.

Two main themes dominate the conference for me: AI for maths and language models. Here I want to discuss them, first individually and then how they relate to each other.

AI for maths:

More people than ever are getting interested in AI for maths. The tutorial on machine learning for theorem proving captivated a room of ~300 people; For the first time I heard “theorem proving” being mentioned in the halls of a machine learning conference venue by random people.

Behind the popularity are two seemingly contradictory views: one is that AI is finally good enough that we have a glimpse of what they can do for maths; the other is that AI is still extremely stupid in reasoning that mathematical abilities are something that we desperately need to improve on. They are not really contradictory. Moravec’s paradox rules over mathematics too: while being very helpful in certain tasks such as embedding the entire formal library and enabling searching over them with hybrid natural and formal languages, even the most advanced LLMs by themselves can’t reliably verify abstract reasoning steps (even the very simple ones).

So, what are the promising directions?

There are many and my opinions here are likely to be quite limited and should not deter you from carrying out research in completely different directions.

Build a vibrant community for people working on AI for maths research.
- Lower the entry barrier for people. MATH-AI is still a hazardous and messy place
  - Where are the entry points? What are the benchmarks?
  - Why do I need to do metaprogramming in Lean or Scala + ML to get anything to work?
  - To grow MATH-AI as a scientific field, these central practices need defining.
- Make basic research tools more available and low-cost
  - Running basic proof search that was SOTA in 2022 still seems elusive for academics
    - The HTPS project had 400 GPUs training for 2 weeks, a dedicated CPU cluster to host Lean servers, and a stellar team to build the search so it was an engineering feast.
    - But 1.5 years later, we have: better and more efficient pre-trained models, better quantisation techniques, Lean4 that is much faster in performance and hence doesn’t hurt if it runs on cpus attached to GPUs, and a much larger community.
    - We should be able to reproduce it and have something similar that runs on an M1 or even cpus.
Train a verifier that works beyond simple arithmetic reasoning.
- It’s unclear what data/architecture is needed for a good verifier. Even within some company very focused on mathematical reasoning, there are different opinions about what the right way to do it is, formal or informal.
- Formal signals are nice because they are guaranteed to be “signals”, though very conservative and not flexible.
- I have less confidence than before the conference that fine-tuning a (currently existing) pretrained LLM can result in a good verifier. This is because most popular LLMs are pretrained on and imitating large internet corpora, which cannot be expected to induce mathematical reasoning capabilities. Fine-tuning can inject more knowledge into the LLMs, or even shift the LLM into the more maths-loving mode. But picking up complicated novel skills such as maths is probably quite hard.
- Just pick your preferred way and do it! We need to try out both at some stage!
Get data for “how the sausage is made” for mathematical proofs.
- Mathematicians often demolish the scaffolding after the house is built. This means there are only proofs without any trial-and-error in the literature. This is not great for reasoning.
- Use LLMs to construct the missing traces.
- Maybe encourage mathematicians to elaborate (as advocated by Timothy Gowers at some point) is easier?
Elicit and utilise abstract knowledge/skills.
- There are some abstract skills and lots of data can be seen as instantiations of them. The question is how to extract and apply them.
- The auto-regressive decoding objective is not great for this. Changing the objective to be more “abstract” at fine-tuning time (while ensuring the high utilisation of GPUs!) can be an interesting topic to explore for a new PhD student.

I leave with more questions than when I came.

Language models

More thoughts on language models are a bit more fragmented:

It’s very sad to see some labs only publish things that don’t work (or the second-best methods). People should be extra cautious about what they choose to believe in, even for NeurIPS posters, spotlights, and orals. It’s ok if you don’t feel like sharing your trade secrets, just don’t mislead people.
The sentence that is stuck in my head is Paige Bailey’s great motto: “Keep it simple, make it scale”. I don’t believe in extremely fancy and brittle prompting techniques to ever be of much use, but the motto extends beyond just promoting.
The reason I start working for Mistral AI (apart from that it’s a great company with great people) is that I really don’t like the choice between subpar open models and api calls in my research. So I want to push open models to as good as the best closed models. At this day and age, the reality is that there are orgs serving different purposes and working on language models that have different levels of openness. Science in the open; open science; open-sourced; open-weights, etc.

Both together

If it wasn’t clear 2 years ago, it should be very clear now that language models are an essential component of building general AI mathematicians/assistants and mathematics is a great area to test language models in. It’s time we push hard in both directions.