The China-based AI lab DeepSeek has released a new open-weight model, DeepSeekMath-V2.
The model, as per the AI lab, demonstrates strong theorem-proving capabilities in mathematics and achieved gold-level scores on the International Mathematics Olympiad (IMO) 2025.
It solved 5 of 6 problems at the IMO 2025.
“Imagine owning the brain of one of the best mathematicians in the world for free,” said Clement Delangue, co-founder and CEO of Hugging Face in a post on X.
“As far as I know, there isn’t any chatbot or API that gives you access to an IMO 2025 gold-medalist model,” he added.
In July, an advanced version of Google DeepMind’s Gemini model and an experimental reasoning model from OpenAI also achieved the gold status on the IMO 2025. Like DeepSeek’s new model, both OpenAI and Google’s models also solved 5 out of 6 problems. These were the first AI models to achieve the gold level scores.
IMO is often regarded as the toughest high-school mathematics contest globally. Of the 630 students who participated in IMO 2025, 72 earned gold medals.
Besides the IMO 2025 competition, DeepSeekMath-V2 also achieved top-tier performance on China’s toughest national competition, the China Mathematical Olympiad (CMO), and posted near-perfect results on the undergraduate Putnam exam.
“On Putnam 2024, the preeminent undergraduate mathematics competition, our model solved 11 of 12 problems completely and the remaining problem with minor errors, scoring 118/120 and surpassing the highest human score of 90,” stated DeepSeek.
DeepSeek argues that recent AI models excel at getting the right answers (in math benchmarks like AIME and HMMT) but often lack sound reasoning.
“Many mathematical tasks like theorem proving require rigorous step-by-step derivation rather than numerical answers, making final answer rewards inapplicable.”
To address this, DeepSeek emphasises the need for models that can judge and refine their own reasoning. The team argues that “self-verification is particularly important for scaling test-time compute, especially for open problems without known solutions.”
For context, test-time compute refers to allocating large amounts of computation during inference — not training — to let the model reason longer, explore multiple solutions, and refine its answers.
DeepSeek’s approach trains a dedicated verifier that scores the quality of proofs, not answers, and then uses this verifier to guide a separate proof-generation model. The generator is rewarded only when it fixes its own mistakes, not when it hides them.
As the paper explains, they “train a proof generator using the verifier as the reward model, and incentivise the generator to identify and resolve as many issues as possible in their own proofs before finalising them.”
To prevent the system from overfitting to its own checker, DeepSeek continually makes the verification process harder.
It does this by increasing compute and automatically labeling difficult proofs, ensuring the verifier evolves alongside the generator.
In their words, this allows them “to scale verification compute to automatically label new hard-to-verify proofs, creating training data to further improve the verifier.”
The model’s weights can be downloaded on Hugging Face. “That’s democratisation of AI and knowledge at its best, literally,” said Delangue.
DeepSeek rose to prominence after releasing a low-cost, open-source model that rivalled US AI systems. Its DeepSeek-R1 reasoning launch sparked questions about whether open models could erode the commercial edge of closed products, briefly rattling investor confidence in AI giants such as NVIDIA.

