Research by Shan Yu, LMSYS Contributor: Optimizing Multi-LLM Serving with Hyperbolic

X Discord Reddit Youtube Linkedin

Shan Yu, in collaboration with researchers from UCLA and LMSYS, has conducted groundbreaking research in optimizing multi-LLM serving infrastructure using Hyperbolic Labs' GPU clusters. This work provides critical insights into real-world LLM deployment patterns and introduces innovative solutions to address the challenges of efficient resource utilization.

Prism: A Novel Approach to Multi-LLM Serving

Across multiple experiments, Shan and colleagues analyzed 4 months of production data from Hyperbolic's infrastructure serving 24 different models. Their research identified several key challenges in multi-LLM serving environments and developed Prism—a dynamic GPU sharing and memory coordination system that significantly improves resource utilization.

Key Findings from Hyperbolic's Workload

Long-tail model popularity: A small number of models receive most requests, while many models get minimal traffic (<100 requests/hour)
Frequent idle periods: Over 60% of models experience more than 1,000 idle periods (>10 seconds each), with some models idle for at least 27% of the time
Rapid workload fluctuations: Request rates can change by more than 5× within just one minute
Resource underutilization: Using traditional dedicated GPU allocation, Hyperbolic's workload would require ~120 H100 GPUs but achieve only 1.3% compute utilization and 28.4% memory utilization on average

These findings highlight a critical problem in multi-LLM serving: with traditional static partitioning, resources are significantly underutilized, leading to inefficient GPU usage and higher operational costs.

Prism's Performance Improvements

When tested with Hyperbolic's trace data, Prism demonstrated substantial performance improvements:

2.3× more requests supported than MuxServe++ while maintaining 99% SLO attainment
3.5× more requests than static partitioning approaches
Ability to achieve 99% TTFT SLO attainment at higher request rates than existing systems

Metric	Value
Production Data Period	4 months
Models Analyzed	24
Request Rate Increase	2.3× vs MuxServe++
Request Rate Increase	3.5× vs static partitioning
SLO Attainment	99% TTFT

Impact of Hyperbolic Labs' Infrastructure

The research leveraged Hyperbolic's production environment to analyze real-world LLM serving patterns at scale. This unique access to production data enabled the researchers to identify the challenges of multi-LLM serving and develop Prism as an effective solution.

Looking Ahead

For a service provider like Hyperbolic, Prism's approach could potentially:

Significantly reduce the number of GPUs needed to serve the same workload
Better handle the bursty, unpredictable nature of their multi-model serving patterns
Improve cost efficiency while maintaining service quality

The paper essentially uses Hyperbolic as a representative example of real-world multi-LLM serving challenges and demonstrates how dynamic GPU sharing and memory coordination can address the inefficiencies inherent in their workload patterns.

"By leveraging Hyperbolic Labs’ production environment, we were able to study real-world multi-LLM serving patterns at scale. Our research identified critical inefficiencies in traditional GPU allocation, and through Prism we demonstrated how dynamic GPU sharing and memory coordination can dramatically improve utilization while maintaining service quality. Hyperbolic’s infrastructure provided the foundation for this work, making it possible to translate research into actionable solutions for the future of large-scale LLM deployment." — Shan Yu, PhD student at UCLA, LMSYS Contributor

About Hyperbolic

Hyperbolic is the on-demand AI cloud made for developers. We provide fast, affordable access to compute, inference, and AI services. Over 195,000 developers use Hyperbolic to train, fine-tune, and deploy models at scale.

Our platform has quickly become a favorite among AI researchers, including those like Andrej Karpathy. We collaborate with teams at Hugging Face, Vercel, Quora, Chatbot Arena, LMSYS, OpenRouter, Black Forest Labs, Stanford, Berkeley, and beyond.

Founded by AI researchers from UC Berkeley and the University of Washington, Hyperbolic is built for the next wave of AI innovation—open, accessible, and developer-first.

Research by Shan Yu, LMSYS Contributor: Optimizing Multi-LLM Serving with Hyperbolic

Prism: A Novel Approach to Multi-LLM Serving

Key Findings from Hyperbolic's Workload

Prism's Performance Improvements

Impact of Hyperbolic Labs' Infrastructure

Looking Ahead

About Hyperbolic

More Articles

GAUSS: General Assessment of Underlying Structured Skills in Mathematics

Research by Sijun Tan: Learning Long Contexts Offline

OpenAI's GPT-OSS 120b and 20b

Exploring Qwen-Image: Alibaba's Breakthrough in Text-to-Image Generation

Introducing Network Storage

AI at the Math Olympiad: A New Era of Mathematical Problem-Solving

spML Breakdown