Research by Sijun Tan: Learning Long Contexts Offline

X Discord Reddit Youtube Linkedin

Sijun Tan, a researcher at UC Berkeley (previously project lead at Agentica), led the development of LLoCO (Learning Long Contexts Offline): a novel approach for efficient long-context processing in large language models. By combining context compression with in-domain parameter-efficient finetuning, LLoCO enables LLaMA2-7B models with a 4k token window to handle up to 128k tokens while using 30× fewer tokens at inference.

Hyperbolic Labs’ GPU infrastructure, specifically a week of 8× NVIDIA H100 SXM GPUs, was integral to training and evaluating LLoCO on long-context benchmarks, achieving both performance gains and significant cost reductions.

LLoCO: Compress, Finetune, and Serve

The Challenge

Transformer-based LLMs incur quadratic compute and memory costs as context length increases. This makes processing 100k+ token documents prohibitively expensive in both latency and compute cost. Existing context extension methods often suffer from degraded performance or high VRAM usage.

The Solution

LLoCO addresses this by:

Offline Context Compression – Long documents are chunked and processed by a context encoder (AutoCompressor) to produce “summary embeddings” 30× shorter than the original.
In-Domain LoRA Finetuning – These compressed embeddings are used to finetune LoRA adapters specific to document domains (e.g., academic papers, narratives), aligning the LLM to interpret and reason over compressed contexts.
RAG-Compatible Serving – At inference, retrieved compressed embeddings are prepended to the LLM prompt with the matching LoRA adapter applied, enabling fast, accurate retrieval-augmented generation.

Results

Performance Gains

Across long-document QA datasets like QuALITY, Qasper, NarrativeQA, HotpotQA, and QMSum, LLoCO outperformed both uncompressed and retrieval-based baselines on all datasets, matched or exceeded the performance of LLaMA2-7B-32k with 30× fewer tokens, and achieved a +13.64 average score improvement over AutoCompressor without finetuning.

Efficiency Gains

Using Hyperbolic’s H100 SXM infrastructure:

Inference Latency: Up to 7.62× speedup for long-context sequences (128k tokens) compared to baseline.
Finetuning Throughput: Up to 11.52× higher throughput than full-context finetuning on NarrativeQA.
Enabled processing of 128k-token sequences within the VRAM limits of a single H100, whereas uncompressed models ran out of memory beyond 32k.

Experimental Setup

Metric	Value
GPUs Used	8× NVIDIA H100 SXM
Base Model	LLaMA2-7B
Max Effective Context	128,000 tokens
Compression Ratio	30×
Datasets	QuALITY, Qasper, NarrativeQA, HotpotQA, QMSum
Finetuning Method	LoRA (rank=8)

Impact of Hyperbolic Labs' Infrastructure

LLoCO’s training required stable, high-bandwidth GPUs to:

Maintain persistent environments for inner-loop/outer-loop optimization.
Perform long-sequence compression and finetuning without hitting memory bottlenecks.
Run large-batch experiments for retrieval-augmented generation compatibility testing.

Hyperbolic’s on-demand H100 compute allowed the team to iterate quickly on compression ratios, LoRA configurations, and retrieval pipelines, significantly accelerating the research cycle.

About Hyperbolic

Hyperbolic is the on-demand AI cloud made for developers. We provide fast, affordable access to compute, inference, and AI services. Over 195,000 developers use Hyperbolic to train, fine-tune, and deploy models at scale.

Our platform has quickly become a favorite among AI researchers, including those like Andrej Karpathy. We collaborate with teams at Hugging Face, Vercel, Quora, Chatbot Arena, LMSYS, OpenRouter, Black Forest Labs, Stanford, Berkeley, and beyond.

Founded by AI researchers from UC Berkeley and the University of Washington, Hyperbolic is built for the next wave of AI innovation—open, accessible, and developer-first.