Blog

Can Large Language Models Prove Undergraduate and Graduate-Level Mathematics?

Small Language Models in Mathematics

Benchmark Performance

Recent studies introduce rigorous benchmarks for university mathematics (linear algebra, calculus, real analysis, abstract algebra, discrete math). For example, U-MATH comprises 1,100 university-level problems (text and multimodal) in six core subjects. Even the strongest LLMs score modestly: GPT-4 tops at only ~63% on U-MATH text tasks.

Specialized small models can outperform much larger ones. In U-MATH text problems, the 7B-parameter Qwen2.5-Math model achieved 45.2% accuracy, surpassing the 70B LLaMA-3.1 (50B) model. By contrast, a generic 8B LLaMA-3.1 scored only 26.1%. These results highlight that careful training (e.g. math-specific tuning) greatly helps small models.

Other benchmarks echo this trend: on the Math-500 (a subset of the Hendrycks MATH dataset), a 1.5B Qwen model reached 83.9% accuracy after distillation, and a 7B Qwen2.5-Math model reached 90.0% – even exceeding performance of much larger baselines.

Overall, with modern training techniques small models (3–8B) can achieve performance comparable to or better than some larger models on math problem-solving benchmarks.

Small LLMs on Formal Proofs

The MiniF2F benchmark (formal Olympiad problems translated into Lean) tests rigorous proof generation. Here too small LLMs can do well with assistance. For instance, Leanabell-Prover-V2 (7B) uses Lean4 feedback and reinforcement learning to achieve 78.2% pass@128 accuracy on MiniF2F.

Similarly, the Apollo pipeline (7B) – which repeatedly invokes a small LLM, fixes errors via the Lean compiler, and recombines subproofs – reaches 75.0% accuracy. In comparison, older 7B baselines (~76%) were outperformed by these methods.

Thus, with verification and repair, 7B models solve most MiniF2F problems, rivaling much larger models. On less formal contest math (AIME/AMC), enhanced small models also fare well: e.g. the rStar-Math method enabled a 3.8B model to solve 86.4% of MATH-500 problems and outperform GPT-4 on average accuracy.

Performance Tables

Model (Params) MATH-500 Accuracy
Qwen-1.5B (distilled) 83.9%
Phi-4-Mini-Reasoning (3.8B) 86.4%
Qwen2.5-Math (7B, rStar) 90.0%
LLaMA-3 (8B, fine-tuned) 89.1%
Model (Params) MiniF2F (pass@128)
Leanabell-Prover-V2 (7B) 78.2%
Apollo (7B pipeline) 75.0%

Techniques and Fine-tuning

Off-the-shelf small LLMs have limited innate reasoning, but their math abilities can be dramatically improved by specialized training. Chain-of-thought (CoT) prompting helps LLMs list intermediate steps, but models under ~100B parameters generally need fine-tuning on CoT data to benefit.

For example, instruction-tuning Flan-T5 (3B) with 1.8M CoT examples boosted its reasoning accuracy by ~4%. Similarly, supervised fine-tuning on large synthetic CoT datasets (via knowledge distillation from bigger models) has been key: in Phi-4-Mini, two stages of distillation and RL on correctness yielded a 3.8B model that outperforms 7–8B baselines.

Other methods include self-consistency (ensemble voting), Monte Carlo Tree Search, and multi-stage reasoning. The rStar-Math approach iteratively generates large CoT datasets via self-play and an SLM-based reward model; after four rounds it boosted a 7B model to state-of-art MATH performance without relying on a teacher model.

In formal math, combining RL and formal verifier feedback is powerful: Leanabell-Prover-V2 integrates Lean4 checks during training so the 7B model "learns" to avoid unprovable steps. Throughout, few-shot prompting alone is insufficient – targeted finetuning on high-quality reasoning data (often mined from stronger LMs) is critical for small models to solve complex problems.

Limitations and Need for Verification

Despite progress, small LLMs still lack full rigor or abstract reasoning. They excel at formulaic calculation or standard exercises (U-MATH results are highest in Precalculus/Algebra) but struggle with deep conceptual or visual problems (integral calculus was the hardest subject).

Hallucinations are common: LLMs may fabricate intermediate steps or misunderstand definitions. Under-graduate math often demands strict proofs and formal language, which LLMs do not inherently produce. Formal verification studies emphasize this: a proof assistant will simply mark incorrect LLM outputs as wrong, and naive approaches require sampling thousands of attempts to find a valid proof.

Thus, small LMs rely heavily on external checks or human oversight. Modern pipelines fix LLM errors via compilers or automated solvers, but without such scaffolding, a small model's solution is not guaranteed correct. In other domains like abstract algebra or real analysis, evidence is scarce, implying these models have yet to master truly rigorous proofs in those fields.

In summary, small LLMs can often generate plausible math reasoning, but verifying and refining that reasoning remains necessary.

Conclusion

Small language models (on the order of 1–10B parameters) have made notable strides in undergraduate math reasoning through advanced training. On many benchmarks they solve a large fraction of problems – sometimes rivaling far bigger models – when fine-tuned with chain-of-thought data, reinforcement learning, or search-based techniques.

However, their outputs are not inherently reliable. They struggle with abstraction and formal correctness, so independent proof generation generally requires external verification (e.g. theorem provers) or human guidance.

In short, while small LLMs can tackle many calculus, algebra, and discrete problems given the right training, truly verifiable proofs and deep conceptual understanding remain beyond their unaided scope.