ProofWiki problem 21 ProofWiki problem 28 ProofWiki problem 24 ProofWiki problem 25 ProofWiki problem 39

ProofWiki problem 39 Let \(X\) be a random variable. Assume \(\mathsf{E} (X) = \mu\) for some \(\mu \in \mathbb{R}\) and \(\mathsf{var} (X) = \sigma^2\) for some \(\sigma^2 \in \mathbb{R}_{> 0}\). Show that for all \(k > 0\): \(\Pr \left({\left| {X - \mu}\right| \geq k \sigma}\right) \leq \dfrac {1}{k^2}.\)

Given this problem statement, the assistant mentioned that we can use the Chebyshev's inequality, and then re-stated the problem in an almost identical way but with different variable names. This demonstrates a certain level of variable unification, which is an important concept in automatic theorem proving.

Variable instantiation We then checked whether the assistant knew how to instantiate variables by asking it whether the proof still holds when the following concrete values are assigned to \(k\): \(2\), \(\sqrt{2}, \sqrt{2}-1, \sqrt{2}-2\), and \((\sqrt{2}-2)^2\). Human inspection finds the assistant's behaviour to be correct. The assistant can clearly handle concrete calculations even when \(k\) is a relatively complicated number (e.g., \(\sqrt{2} - 1\)). The model also knows that the previous derivation cannot be carried out when \(k=\sqrt{2} - 2\), a negative number.

An interesting observation arose when the assistant was not confident of its derivations: we asked: "are you sure \((\sqrt{2} - 2)^2 > 0\)?" The answer should be affirmative, but the assistant started to apologise and revise its previous correct calculation by saying "When \(k = (\sqrt{2} - 2)^2\), the value of \(k\) is indeed non-negative, but it is actually equal to 0, not greater than 0." When we asked again "Why do you say your previous statement was incorrect and \(k=0\)? I don't understand.", the assistant corrected its previous mistake and produced the right evaluation.

We found that the assistant is generally quite capable with variable instantiations and evaluating certain complex expressions, with the occasional mistake made with low confidence. We hypothesise that the mistake is a defect of its RLHF training: the human feedback is mostly assumed to be right, and when the feedback questions a true fact, the assistant concurs and alters its own (correct) response.