Benchmark Performance
Recent studies introduce rigorous benchmarks for university mathematics (linear algebra, calculus, real analysis, abstract algebra, discrete math). For example, U-MATH comprises 1,100 university-level problems (text and multimodal) in six core subjects. Even the strongest LLMs score modestly: GPT-4 tops at only ~63% on U-MATH text tasks.
Specialized small models can outperform much larger ones. In U-MATH text problems, the 7B-parameter Qwen2.5-Math model achieved 45.2% accuracy, surpassing the 70B LLaMA-3.1 (50B) model. By contrast, a generic 8B LLaMA-3.1 scored only 26.1%. These results highlight that careful training (e.g. math-specific tuning) greatly helps small models.
Overall, with modern training techniques small models (3–8B) can achieve performance comparable to or better than some larger models on math problem-solving benchmarks.
Small LLMs on Formal Proofs
The MiniF2F benchmark (formal Olympiad problems translated into Lean) tests rigorous proof generation. Here too small LLMs can do well with assistance. For instance, Leanabell-Prover-V2 (7B) uses Lean4 feedback and reinforcement learning to achieve 78.2% pass@128 accuracy on MiniF2F.
Similarly, the Apollo pipeline (7B) – which repeatedly invokes a small LLM, fixes errors via the Lean compiler, and recombines subproofs – reaches 75.0% accuracy. In comparison, older 7B baselines (~76%) were outperformed by these methods.
Thus, with verification and repair, 7B models solve most MiniF2F problems, rivaling much larger models. On less formal contest math (AIME/AMC), enhanced small models also fare well: e.g. the rStar-Math method enabled a 3.8B model to solve 86.4% of MATH-500 problems and outperform GPT-4 on average accuracy.
Performance Tables
| Model (Params) | MATH-500 Accuracy |
|---|---|
| Qwen-1.5B (distilled) | 83.9% |
| Phi-4-Mini-Reasoning (3.8B) | 86.4% |
| Qwen2.5-Math (7B, rStar) | 90.0% |
| LLaMA-3 (8B, fine-tuned) | 89.1% |
| Model (Params) | MiniF2F (pass@128) |
|---|---|
| Leanabell-Prover-V2 (7B) | 78.2% |
| Apollo (7B pipeline) | 75.0% |
Techniques and Fine-tuning
Off-the-shelf small LLMs have limited innate reasoning, but their math abilities can be dramatically improved by specialized training. Chain-of-thought (CoT) prompting helps LLMs list intermediate steps, but models under ~100B parameters generally need fine-tuning on CoT data to benefit.
For example, instruction-tuning Flan-T5 (3B) with 1.8M CoT examples boosted its reasoning accuracy by ~4%. Similarly, supervised fine-tuning on large synthetic CoT datasets (via knowledge distillation from bigger models) has been key: in Phi-4-Mini, two stages of distillation and RL on correctness yielded a 3.8B model that outperforms 7–8B baselines.
In formal math, combining RL and formal verifier feedback is powerful: Leanabell-Prover-V2 integrates Lean4 checks during training so the 7B model "learns" to avoid unprovable steps. Throughout, few-shot prompting alone is insufficient – targeted finetuning on high-quality reasoning data (often mined from stronger LMs) is critical for small models to solve complex problems.
Limitations and Need for Verification
Despite progress, small LLMs still lack full rigor or abstract reasoning. They excel at formulaic calculation or standard exercises (U-MATH results are highest in Precalculus/Algebra) but struggle with deep conceptual or visual problems (integral calculus was the hardest subject).
Hallucinations are common: LLMs may fabricate intermediate steps or misunderstand definitions. Under-graduate math often demands strict proofs and formal language, which LLMs do not inherently produce. Formal verification studies emphasize this: a proof assistant will simply mark incorrect LLM outputs as wrong, and naive approaches require sampling thousands of attempts to find a valid proof.
Thus, small LMs rely heavily on external checks or human oversight. Modern pipelines fix LLM errors via compilers or automated solvers, but without such scaffolding, a small model's solution is not guaranteed correct. In other domains like abstract algebra or real analysis, evidence is scarce, implying these models have yet to master truly rigorous proofs in those fields.
Conclusion
Small language models (on the order of 1–10B parameters) have made notable strides in undergraduate math reasoning through advanced training. On many benchmarks they solve a large fraction of problems – sometimes rivaling far bigger models – when fine-tuned with chain-of-thought data, reinforcement learning, or search-based techniques.
However, their outputs are not inherently reliable. They struggle with abstraction and formal correctness, so independent proof generation generally requires external verification (e.g. theorem provers) or human guidance.
In short, while small LLMs can tackle many calculus, algebra, and discrete problems given the right training, truly verifiable proofs and deep conceptual understanding remain beyond their unaided scope.