The International Mathematical Olympiad (IMO) has long stood as a prestigious test of mathematical excellence for the world’s most talented high schoolers.

Now, it’s also emerging as a benchmark for cutting-edge AI systems, challenging them to demonstrate advanced reasoning and problem-solving skills in mathematics

This year marked a historic milestone as AI models from leading research labs, Google DeepMind and OpenAI achieved gold medal performance.

Timeline

  • Friday afternoon: I heard the leaked news about DeepMind achieving a gold medal performance

  • Saturday 1 am: OpenAI announced their results ahead of official confirmation, taking media attention

  • Later learned Google and IMO needed additional time for thorough verification

  • People realized that OpenAI had not formally engaged with IMO for official verification

  • Monday: DeepMind officially confirmed its gold medal status with solutions that were more elegant and fully verified by IMO officials

Technical Breakdown of AI Solutions

Last year, AlphaGeometry and AlphaProof required experts to first translate problems from natural language into domain-specific languages, such as Lean, and vice-versa for the proofs. It also took 2-3 days of computation.

This year, both Gemini and OpenAI models operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions, all within the 4.5-hour competition time limit.

However, there are interesting differences in these approaches.

At a high level, OpenAI’s results are logically correct, but the writing often lacks clarity.

  • Overuses shorthand and sentence fragments

  • Introduces new terms without definitions. For example, “Forbidden”, “sunny partners” e.g. “List all unordered pairs of points in S and check forbidden condition: Forbidden if x equal or y equal or sum equal.”

  • Lacks structural clarity. “Good lemma for S3” just dropped mid-proof

  • Duplicates terminology. “Forbidden” and “Non-sunny” are used interchangeably without explanation

  • Way too verbose. A human writer would have spot the key lemma, handled the n=3 base, built examples for 0/1/3, and finished the problem in ~10% of the length

Gemini’s results are more well-structured and elegant. If their results are presented next to human results, it’s really hard to distinguish which one is generated by AI. IMO graders found Gemini's solutions to be "clear, precise, and most of them easy to follow," as noted by IMO President Prof. Dr. Gregor Dolinar.

Problem 2, the plane geometry problem, highlighted perhaps the most significant difference in approaches. OpenAI employed brute-force analytical geometry, producing a technically correct but lengthy 442-line solution that lacked mathematical insights.

In contrast, DeepMind's Gemini used angle chasing techniques and Sylvester's theorem relating the circumcenter and orthocenter, demonstrating a more elegant mathematical approach that resembled human reasoning.

Key Innovations Behind the Achievement

Google Deepmind’s Gemini’s parallel thinking allows the model to concurrently explore and integrate several potential solutions before settling on a final answer, instead of following a single, linear chain of thought. Additionally, novel reinforcement learning methods utilize extensive multi-step reasoning, problem-solving, and theorem-proving datasets. The model also benefits from a curated collection of high-quality mathematical solutions, supplemented by general hints and strategies for tackling IMO problems.

OpenAI is advancing general-purpose reinforcement learning and scaling test-time compute, moving beyond narrow, task-specific methods to achieve this level of capability.

What This Means for the Future of Mathematics

We started seeing AI solve math competition problems like AIME, USAMO and IMO. AI improved faster than most people predicted, but there’s still a long way for it to reach the level of mathematicians. Just like any exam, you can’t test the real capabilities of AI in a domain with mere competition problems. Real math research requires far more than solving high school math problems. For example: logical reasoning, deep understanding of mathematical theory, abstract problem-solving, rapid assimilation of new knowledge, and even the creation of entirely new problems and solutions. To reach that ultimate goal, we still need breakthrough innovations. My bet is designing precise reward functions that target distinct mathematical skills, and using reinforcement learning to optimize each one.

But who knows? We might just see a wild-card approach that surprises everyone and takes AI math capabilities to the next level.

As AI systems continue to advance in mathematical reasoning, the demand for GPUs will only grow. Hyperbolic’s mission to provide accessible, affordable GPUs becomes even more critical for researchers pushing the boundaries of AI capabilities in domains like mathematics.

Check out the full podcast here:

About Hyperbolic

Hyperbolic is the on-demand AI cloud made for developers. We provide fast, affordable access to compute, inference, and AI services. Over 195,000 developers use Hyperbolic to train, fine-tune, and deploy models at scale.

Our platform has quickly become a favorite among AI researchers, including those like Andrej Karpathy. We collaborate with teams at Hugging Face, Vercel, Quora, Chatbot Arena, LMSYS, OpenRouter, Black Forest Labs, Stanford, Berkeley, and beyond.

Founded by AI researchers from UC Berkeley and the University of Washington, Hyperbolic is built for the next wave of AI innovation—open, accessible, and developer-first.

Website | X | Discord | LinkedIn | YouTube | GitHub | Documentation