OpenAI's GPT-OSS 120b and 20b

OpenAI Breaks New Ground: Introducing GPT-OSS, the Latest Open-Source AI Models

In a surprising move that could reshape the landscape of artificial intelligence, OpenAI has unveiled two new open-source language models: gpt-oss-120b and gpt-oss-20b. Announced on August 5, 2025, these models represent a significant shift for the company, traditionally known for its proprietary models, towards greater openness and accessibility in AI development.[Introducing gpt-oss, OpenAI]

The Models: Power and Efficiency Combined

The flagship model, gpt-oss-120b, boasts 117 billion parameters in total, with a mixture-of-experts (MoE) architecture that activates only 5.1 billion parameters per token. This design allows it to run efficiently on a single 80 GB GPU, making it feasible for a wider range of developers and organizations without massive computational resources. Its smaller sibling, gpt-oss-20b, features 21 billion total parameters (activating 3.6 billion per token) and can operate on edge devices with just 16 GB of memory, enabling on-device inference and local applications.[Introducing gpt-oss, OpenAI]

Both models support context lengths up to 128,000 tokens and are trained primarily on English text data focused on STEM, coding, and general knowledge. They utilize the open-sourced o200k_harmony tokenizer and are post-trained on the harmony prompt format, with rendering tools available in Python and Rust for easy integration. [Introducing gpt-oss, OpenAI]

Standout Capabilities: From Reasoning to Real-World Applications

What sets these models apart is their emphasis on practical, agentic workflows. They excel in tool use, few-shot function calling, and Chain-of-Thought (CoT) reasoning, with adjustable effort levels (low, medium, high) to balance speed and depth. For instance, they can handle web searches, Python code execution, and structured outputs, making them ideal for building intelligent agents.[Introducing gpt-oss, OpenAI]

In specialized domains, the models shine: they outperform proprietary counterparts like OpenAI's o1 and GPT-4o on HealthBench for health-related queries. Their instruction-following abilities and customizability further enhance their utility for developers looking to fine-tune for specific tasks.[Introducing gpt-oss, OpenAI]

Benchmark Performance: Competing with the Best

OpenAI's benchmarks paint an impressive picture. The gpt-oss-120b achieves near-parity with o4-mini on core reasoning tasks and surpasses it in areas like competition mathematics (AIME 2024 & 2025) and health queries. It also edges out o3-mini on tool calling (TauBench) and general problem-solving (MMLU, HLE).[Introducing gpt-oss, OpenAI]

Even the more compact gpt-oss-20b holds its own, matching or exceeding o3-mini across similar evaluations while being optimized for lower-resource environments. Safety-wise, both models perform comparably to frontier systems on internal benchmarks, with adversarially fine-tuned versions tested under OpenAI's Preparedness Framework (pages 3-4).[OSS Safety Paper]

For context among open-weight peers, gpt-oss-120b also compares favorably to models like DeepSeek-R1, outperforming it in agentic tasks like TauBench while being on par in math and code benchmarks.[DeepSeek-R1 Benchmarks]

To further highlight the competitive landscape, below are two graphs comparing the performance of GPT-OSS models to DeepSeek-R1 and other benchmarks.

Complementing these visuals, OpenAI's Table 3 provides comprehensive evaluations across multiple benchmarks and reasoning levels (low, medium, high) for both gpt-oss-120b and gpt-oss-20b. Key highlights include gpt-oss-120b's high-effort scores: 95.8% on AIME 2024 (no tools), 97.9% on AIME 2025 (with tools), 90.0% on MMLU, and 62.4% on SWE-Bench Verified, demonstrating robust performance in math, science, and software engineering.

Evaluating GPT-OSS from a Compute Perspective

From a compute standpoint, the GPT-OSS models stand out for their efficiency, thanks to the Mixture-of-Experts (MoE) architecture that minimizes active parameters during inference with 5.1 billion for gpt-oss-120b and 3.6 billion for gpt-oss-20b despite their larger total parameter counts. This design drastically reduces memory footprint and computational overhead, enabling gpt-oss-120b to run on a single 80 GB GPU (such as an NVIDIA H100) and gpt-oss-20b on devices with just 16 GB of memory, including consumer hardware like AMD Radeon GPUs. [NVIDIA Blog, NVIDIA] [AMD Blog, AMD]

Both models are natively quantized in MXFP4 format, further optimizing memory usage and inference speed without significant performance degradation, though MXFP4 requires NVIDIA Hopper (e.g., H100) or Blackwell architectures for full support; on older hardware, it may default to BF16, increasing memory needs. For instance, gpt-oss-120b fits comfortably within 80 GB on compatible GPUs, allowing for efficient deployment, while gpt-oss-20b's low requirements make it ideal for edge computing and on-device applications.[Introducing gpt-oss, OpenAI]

In terms of inference performance, early benchmarks highlight impressive throughput: on NVIDIA's Blackwell GB200 NVL72 system, gpt-oss-120b can achieve up to 1.5 million tokens per second, showcasing scalability for cloud-scale operations. On a single H100, users can expect robust speeds for real-time applications, though exact tokens-per-second vary by context length and effort level (low, medium, high). The models' training itself underscores their compute efficiency. Gpt-oss-120b required over 2.1 million GPU hours on H100s, with gpt-oss-20b needing about 10 times less, demonstrating optimized pre-training pipelines.[NVIDIA Blog, NVIDIA][Cerebras News, Cerebras]

For developers, this translates to lower barriers: fine-tuning or inference doesn't demand massive clusters. Platforms like Hyperbolic democratize access further by offering on-demand H100 rentals at $1.49 per hour, enabling cost-effective experimentation without upfront hardware investments. This is particularly advantageous for MoE models like these, where vLLM or Ollama can leverage the GPU for high-throughput serving, supporting long contexts up to 128k tokens.

Overall, GPT-OSS models redefine GPU-efficient AI, balancing frontier-level capabilities with practical compute demands, making advanced reasoning accessible to a broader audience.

Release and Accessibility: Open for All on Hyperbolic

The models are now available for serverless inference on Hyperbolic, enabling easy access via API for chat completions. Developers can integrate them using the standard OpenAI completions structure. Interactive experimentation is possible through playground interfaces at the following model pages:

For the 120b parameter model: https://app.hyperbolic.ai/models/gpt-oss-120b

For the 20b parameter model: https://app.hyperbolic.ai/models/gpt-oss-20b

This release democratizes access to high-performance AI, potentially accelerating innovation in fields like healthcare, education, and software development. By open-sourcing these models, OpenAI invites the global community to build upon them, fostering a more collaborative future for AI. As the dust settles on this morning's drop, one thing is clear: the era of accessible, powerful open-source AI has arrived.

References

[Introducing gpt-oss, OpenAI] https://openai.com/index/introducing-gpt-oss/

[OSS Safety Paper] https://cdn.openai.com/pdf/231bf018-659a-494d-976c-2efdfc72b652/oai_gpt-oss_Model_Safety.pdf

[NVIDIA Blog, NVIDIA] https://developer.nvidia.com/blog/delivering-1-5-m-tps-inference-on-nvidia-gb200-nvl72-nvidia-accelerates-openai-gpt-oss-models-from-cloud-to-edge/

[AMD Blog, AMD] https://www.amd.com/en/blogs/2025/how-to-run-openai-gpt-oss-20b-120b-models-on-amd-ryzen-ai-radeon.html

[Cerebras News, Cerebras] https://www.cerebras.ai/blog/cerebras-launches-openai-s-gpt-oss-120b-at-a-blistering-3-000-tokens-sec

[DeepSeek-R1 Benchmarks] https://huggingface.co/deepseek-ai/DeepSeek-R1

About Hyperbolic

Hyperbolic is the on-demand AI cloud made for developers. We provide fast, affordable access to compute, inference, and AI services. Over 195,000 developers use Hyperbolic to train, fine-tune, and deploy models at scale.

Our platform has quickly become a favorite among AI researchers, including those like Andrej Karpathy. We collaborate with teams at Hugging Face, Vercel, Quora, Chatbot Arena, LMSYS, OpenRouter, Black Forest Labs, Stanford, Berkeley, and beyond.

Founded by AI researchers from UC Berkeley and the University of Washington, Hyperbolic is built for the next wave of AI innovation—open, accessible, and developer-first.