Thoughts

I want to achieve mathematical superintelligence, so why am I doing LLMs now?

10 December 2024

I’ve recently defended my PhD, which is on AI for formal mathematics. Now I’m a full-time researcher at Mistral AI, working on reasoning. A significant shift in my work is that now I spend my working hours on informal reasoning rather than formal mathematics inside proof assistants, although I maintain some connections with academia and advise on formal mathematics projects.

To justify this shift in focus (mainly for myself), let me try to lay down some thoughts.

How is mathematics done differently by humans and AIs

The three stages of mathematics

One can roughly divide mathematical education into three stages:

  1. The “pre-rigorous” stage, in which mathematics is taught in an informal, intuitive manner, based on examples, fuzzy notions, and hand-waving. (For instance, calculus is usually first introduced in terms of slopes, areas, rates of change, and so forth.) The emphasis is more on computation than on theory. This stage generally lasts until the early undergraduate years.
  2. The “rigorous” stage, in which one is now taught that in order to do maths “properly”, one needs to work and think in a much more precise and formal manner (e.g. re-doing calculus by using epsilons and deltas all over the place). The emphasis is now primarily on theory; and one is expected to be able to comfortably manipulate abstract mathematical objects without focusing too much on what such objects actually “mean”. This stage usually occupies the later undergraduate and early graduate years.
  3. The “post-rigorous” stage, in which one has grown comfortable with all the rigorous foundations of one’s chosen field, and is now ready to revisit and refine one’s pre-rigorous intuition on the subject, but this time with the intuition solidly buttressed by rigorous theory. (For instance, in this stage one would be able to quickly and accurately perform computations in vector calculus by using analogies with scalar calculus, or informal and semi-rigorous use of infinitesimals, big-O notation, and so forth, and be able to convert all such calculations into a rigorous argument whenever required.) The emphasis is now on applications, intuition, and the “big picture”. This stage usually occupies the late graduate years and beyond.

— Terence Tao, There’s more to mathematics than rigour and proofs

During the PhD viva, one of my examiners asked something in the vein of “Isn’t there a danger of doing futile work by teaching formal mathematics to AI when the final form of humans mathematics is post-rigorous?”. I answered by asking my two examiners which stage they think current SoTA AI is at. They agreed that it’s somewhere between pre-rigorous and rigorous. “So”, I smiled confidently, “formal mathematics is the necessary path to post-rigorous artificial intelligence.”

There is a huge issue with this answer - it assumed that humans and AIs take the same path to mathematical intelligence. Let me now take no bias nor presumption and contrast the two paths analytically.

The spice: signal

There is no question that a good signal is the key to intelligence. One can almost say that intelligence is proportional to (maybe $\log$) data quatity $\times$ signal quality. A good signal is one that is accurate, informative, and diverse. Signal quality also matters more in later stages of model training.

Take the classical regime of LLM training for example:

From the pedagogical perspective, the best way of teaching reasoning works a little bit like this:

From the simple boolean variable of right/wrong to the rich thought-provoking guides, the teacher provides a wide range of signals. As social animals, humans are trained to digest signals into usable knowledge and skills. But for machines, they need a more defined scheme of how they are supposed to learn. For language models, the design of objectives is to turn the rich scheme into a single scalar that is a function of logits over discrete tokens. Why is that so hard?

The AI disadvantage: autoregression

Autoregression has one direction, one shape. Want to optimise something? Better transform it to such a format that all the relevant context is rendered in order and the thinking process is also single-directional. Want to think before you speak? Put those Chain of Thoughts tokens before the final answer. Want to use tools to help you solve a problem? Define tools by their function name and arguments, and wrap the output from tools into result blocks. Want to backtrack when things are no longer promising? Do not erase the history and just concatenate the jumping history into a sequence of node traversals.

Christian Szegedy has a nice talk titled the inverse mindset of machine learning: instead of designing architectures to fit different data modalities, just take pretrained transformers and change how the data is presented. Pretrained transformers are very strong, and this mindset is the fundamental argument for LLMs’ efficiency. There is a single thing that people can concentrate efforts to improve, and everything else benefits from having a better starting LLM. The second stage of tweaking data can be done differently for different domains, and optimised easily since it’s so cheap.

In some way, an LLM is like a NAND gate. You can simulate any logical gate with a NAND gate so its upper bound is incredibly high. But the piping to create a circuit is non-trivial and not that cheap either (you need 4 NANDs plus wires to simulate 1 XOR) if you want to create matheamtical superintelligence instead of a customer support bot.

Just think about how you can teach an LLM to find the part of the answer that is wrong in a proof. You can do it but damn it’s hard and requires quite some engineering to ensure correctness. We’re very much in the land of flow engineering and not prompt engineering.

The human advantage: credit assignment

There is one thing that human experts or even non-experts are consistently better than AI at: credit assignment, finding the parts in a solution that are key to its success or failure. Humans often do it verbally and socially, without writing too much of this process online. Maybe that is precisely why AIs are not very good at it yet.

The AI advantage: scale

Obviously it is very hard for a human to learn about all of mathematics ever invented in history. It is also hard for an AI, but to a less extent. It is conceivable that an AI model can read all arxiv math papers in a very short time. To understand is a different matter.

Suppose we want to train AI models to learn a policy and a value function for the theorem proving MDP. The required compute is significant: for the value function to stay on-policy, the model needs to generate trajectories which is not easily massively parallelisable.

The unique advantage of AI is that scaling to multiple devices is always possible and therefore all learning can be on-policy in theory. Each human intelligence has only one copy and the transfer between any two intelligences is through a slow engaged absorption akin to distillation (for spirits, not models).

The myth: The generator-verifier gap and scaling

I don’t know when it started but the idea of a generator-verifier gap that one can exploit has possessed the minds at work in AI for maths. The ideal is simple: for mathematics, it is often easy to verify a proof than it is to generate one. You can even prove a difference of polynomial versus combinatorial complexity for verification versus generation. The gap certainly exists for humans, and we have been trying to exploit it. The practice is called mathematical education and research.

The common argument to extend this exploitation to computers is that if we can generate a lot of data with AI(s), and use verification to perform rejection sampling, then we end up with better, purer data than we started with. This data can be used for expert iteration and hence self-improvement.

To me, this thinking also has some flaws. The reasoning is anthropological and mixes up its logic across different paradigms. The computational complexity argument is very theoretical - humans very often (probabilistically too often if the complexity is really combinatorial) find proofs that are hundreds of pages long. There must be a special class of problems which humans are just very efficient at solving. The complexity of verification is indeed polynomial in some formalisms, but for AIs, due to the shortage of verification training data compared to generation, the former might be as hard as, or even harder than the latter. We might not have a big gap to start with.

The path of least resistance

There is no precedent for a swarm-like human intelligence, where each individual component is stupid but very scalable, to achieve any mathematical significance. So how can we even imagine there being one based on silicon?

The question of whether it can be achieved boils down to whether signals of enough quality and quantity can be acquired. Quantity is easy but quality and diversity do not come easy for AI. To ensure high enough quality, at one point people might have to bite the bullet and implement a formal system with two-way bridges from and to natural language instead of spending combinatorially much compute. The pain of doing so is large but constant, which is preferrable to a flexible, dynamic, and extraodinary pain that is like a ticking time bomb.

So why informal mathematics now?

The incumbent frontier model developers are not going to do things the formal way, before they have the path of least resistance (the formal turn) in their horizon. But we need frontier models both in terms of absolute quality and pareto frontier to do well in maths. It is perfectly fine to explore formal mathematics without frontier models, but you need to remind yourself very often whether your research will be rendered obsolete by newer frontier models, in this regime where scaling informally has far from saturated. This has happened to many traditional NLP tasks by BERT and GPT-3 and many reasoning tasks by O1.

Being fortunate enough to work in a language modelling team with a strong connection to the AI4Math community, pushing as hard as I can down the informal route seems like a smart move. If I can accelerate the realisation that formal tools are crucial, I will consider my mission accomplished. If I find informal mathematics alone to able to bring about mathematical superintelligence, I will also (very happily) accept that I was once wrong about the necessity of formality.

Thoughts after NeurIPS

18 Dec 2023

This is my first NeurIPS - last year I had three papers and no visa.

The general atmosphere is contagious: the echo chamber of AI is indeed comfortable and exciting. When so many people get together to have low-latency conversations that plastically shape their views, an electrifying air ensues. I have been enlightened countless times during a mere week. However, it’s worthwhile to remember that it is still an echo chamber and we should ground ourselves firmly in reality >95% of the time to not be intoxicated and completely detached.

Two main themes dominate the conference for me: AI for maths and language models. Here I want to discuss them, first individually and then how they relate to each other.

AI for maths:

More people than ever are getting interested in AI for maths. The tutorial on machine learning for theorem proving captivated a room of ~300 people; For the first time I heard “theorem proving” being mentioned in the halls of a machine learning conference venue by random people.

Behind the popularity are two seemingly contradictory views: one is that AI is finally good enough that we have a glimpse of what they can do for maths; the other is that AI is still extremely stupid in reasoning that mathematical abilities are something that we desperately need to improve on. They are not really contradictory. Moravec’s paradox rules over mathematics too: while being very helpful in certain tasks such as embedding the entire formal library and enabling searching over them with hybrid natural and formal languages, even the most advanced LLMs by themselves can’t reliably verify abstract reasoning steps (even the very simple ones).

So, what are the promising directions?

There are many and my opinions here are likely to be quite limited and should not deter you from carrying out research in completely different directions.

I leave with more questions than when I came.

Language models

More thoughts on language models are a bit more fragmented:

Both together

If it wasn’t clear 2 years ago, it should be very clear now that language models are an essential component of building general AI mathematicians/assistants and mathematics is a great area to test language models in. It’s time we push hard in both directions.