Shan Yu, in collaboration with researchers from UCLA and LMSYS, has conducted groundbreaking research in optimizing multi-LLM serving infrastructure using Hyperbolic Labs' GPU clusters. This work provides critical insights into real-world LLM deployment patterns and introduces innovative solutions to address the challenges of efficient resource utilization.
Prism: A Novel Approach to Multi-LLM Serving
Across multiple experiments, Shan and colleagues analyzed 4 months of production data from Hyperbolic's infrastructure serving 24 different models. Their research identified several key challenges in multi-LLM serving environments and developed Prism—a dynamic GPU sharing and memory coordination system that significantly improves resource utilization.
Key Findings from Hyperbolic's Workload
Long-tail model popularity: A small number of models receive most requests, while many models get minimal traffic (<100 requests/hour)
Frequent idle periods: Over 60% of models experience more than 1,000 idle periods (>10 seconds each), with some models idle for at least 27% of the time
Rapid workload fluctuations: Request rates can change by more than 5× within just one minute
Resource underutilization: Using traditional dedicated GPU allocation, Hyperbolic's workload would require ~120 H100 GPUs but achieve only 1.3% compute utilization and 28.4% memory utilization on average
These findings highlight a critical problem in multi-LLM serving: with traditional static partitioning, resources are significantly underutilized, leading to inefficient GPU usage and higher operational costs.
Prism's Performance Improvements
When tested with Hyperbolic's trace data, Prism demonstrated substantial performance improvements:
2.3× more requests supported than MuxServe++ while maintaining 99% SLO attainment
3.5× more requests than static partitioning approaches
Ability to achieve 99% TTFT SLO attainment at higher request rates than existing systems
Metric | Value |
---|---|
Production Data Period | 4 months |
Models Analyzed | 24 |
Request Rate Increase | 2.3× vs MuxServe++ |
Request Rate Increase | 3.5× vs static partitioning |
SLO Attainment | 99% TTFT |
Impact of Hyperbolic Labs' Infrastructure
The research leveraged Hyperbolic's production environment to analyze real-world LLM serving patterns at scale. This unique access to production data enabled the researchers to identify the challenges of multi-LLM serving and develop Prism as an effective solution.
Looking Ahead
For a service provider like Hyperbolic, Prism's approach could potentially:
Significantly reduce the number of GPUs needed to serve the same workload
Better handle the bursty, unpredictable nature of their multi-model serving patterns
Improve cost efficiency while maintaining service quality
The paper essentially uses Hyperbolic as a representative example of real-world multi-LLM serving challenges and demonstrates how dynamic GPU sharing and memory coordination can address the inefficiencies inherent in their workload patterns.
"By leveraging Hyperbolic Labs’ production environment, we were able to study real-world multi-LLM serving patterns at scale. Our research identified critical inefficiencies in traditional GPU allocation, and through Prism we demonstrated how dynamic GPU sharing and memory coordination can dramatically improve utilization while maintaining service quality. Hyperbolic’s infrastructure provided the foundation for this work, making it possible to translate research into actionable solutions for the future of large-scale LLM deployment." — Shan Yu, PhD student at UCLA, LMSYS Contributor
About Hyperbolic
Hyperbolic is the on-demand AI cloud made for developers. We provide fast, affordable access to compute, inference, and AI services. Over 195,000 developers use Hyperbolic to train, fine-tune, and deploy models at scale.
Our platform has quickly become a favorite among AI researchers, including those like Andrej Karpathy. We collaborate with teams at Hugging Face, Vercel, Quora, Chatbot Arena, LMSYS, OpenRouter, Black Forest Labs, Stanford, Berkeley, and beyond.
Founded by AI researchers from UC Berkeley and the University of Washington, Hyperbolic is built for the next wave of AI innovation—open, accessible, and developer-first.
Website | X | Discord | LinkedIn | YouTube | GitHub | Documentation