Scaling Small Language Models to Million-Token Contexts
Scaling Small Language Models to Million-Token Contexts
Scaling Small Language Models to Million-Token Contexts
ChatGPT Has a Limited Memory Capacity!
ChatGPT Has a Limited Memory Capacity!
Did you think ChatGPT remembers everything you said to it? This is absolutely not the case for long contexts (over 100k words or 150 images). If you input more words than this (uploading a file equivalent to the length of one or two scientific journal articles will do that), then ChatGPT silently truncates your input to keep it under that limit. This is a limitation that breaks millions of users’ expectations: The user always expects the model to have access to the entirety of the user’s input.
Did you think ChatGPT remembers everything you said to it? This is absolutely not the case for long contexts (over 100k words or 150 images). If you input more words than this (uploading a file equivalent to the length of one or two scientific journal articles will do that), then ChatGPT silently truncates your input to keep it under that limit. This is a limitation that breaks millions of users’ expectations: The user always expects the model to have access to the entirety of the user’s input.
Did you think ChatGPT remembers everything you said to it? This is absolutely not the case for long contexts (over 100k words or 150 images). If you input more words than this (uploading a file equivalent to the length of one or two scientific journal articles will do that), then ChatGPT silently truncates your input to keep it under that limit. This is a limitation that breaks millions of users’ expectations: The user always expects the model to have access to the entirety of the user’s input.



Let’s say the user uploads a book to the platform and asks the model to summarize it. If the platform silently truncates some parts of the book away before passing it to the AI model due to the context length limits, then it won’t be able to summarize the book correctly and could leave out information crucial to the plot. Despite this, the user will have no idea that the model did not have access to the entire context.
Since this is a core technical limitation of the current state-of-the-art transformer-based language model architecture, this situation is the same for any other currently available commercial language model APIs, including OpenAI’s GPT4, Google’s Gemini, Microsoft’s Copilot, etc.
This limitation significantly limits how much information can be used during text generation.
Let’s say the user uploads a book to the platform and asks the model to summarize it. If the platform silently truncates some parts of the book away before passing it to the AI model due to the context length limits, then it won’t be able to summarize the book correctly and could leave out information crucial to the plot. Despite this, the user will have no idea that the model did not have access to the entire context.
Since this is a core technical limitation of the current state-of-the-art transformer-based language model architecture, this situation is the same for any other currently available commercial language model APIs, including OpenAI’s GPT4, Google’s Gemini, Microsoft’s Copilot, etc.
This limitation significantly limits how much information can be used during text generation.
Let’s say the user uploads a book to the platform and asks the model to summarize it. If the platform silently truncates some parts of the book away before passing it to the AI model due to the context length limits, then it won’t be able to summarize the book correctly and could leave out information crucial to the plot. Despite this, the user will have no idea that the model did not have access to the entire context.
Since this is a core technical limitation of the current state-of-the-art transformer-based language model architecture, this situation is the same for any other currently available commercial language model APIs, including OpenAI’s GPT4, Google’s Gemini, Microsoft’s Copilot, etc.
This limitation significantly limits how much information can be used during text generation.
Why Do We Need an LLM with a Larger Working Memory?
Why Do We Need an LLM with a Larger Working Memory?
You might ask, “Who needs that much context length anyway”? To which we say, “Everyone!” Cheap, unlimited context length for LLMs enables endless new possibilities, previously impossible with short context models. For example, retrieval augmented generation (RAG), which is already being used to make generative AI models hallucinate less and provide more useful information, can massively benefit from increased context lengths. Many-shot in-context learning (ICL), which is a technique to adapt the model to a specific task by giving the model lots of examples, can finally be possible for multi-modal inputs, such as images, videos, and a mix of both.
You might ask, “Who needs that much context length anyway”? To which we say, “Everyone!” Cheap, unlimited context length for LLMs enables endless new possibilities, previously impossible with short context models. For example, retrieval augmented generation (RAG), which is already being used to make generative AI models hallucinate less and provide more useful information, can massively benefit from increased context lengths. Many-shot in-context learning (ICL), which is a technique to adapt the model to a specific task by giving the model lots of examples, can finally be possible for multi-modal inputs, such as images, videos, and a mix of both.
You might ask, “Who needs that much context length anyway”? To which we say, “Everyone!” Cheap, unlimited context length for LLMs enables endless new possibilities, previously impossible with short context models. For example, retrieval augmented generation (RAG), which is already being used to make generative AI models hallucinate less and provide more useful information, can massively benefit from increased context lengths. Many-shot in-context learning (ICL), which is a technique to adapt the model to a specific task by giving the model lots of examples, can finally be possible for multi-modal inputs, such as images, videos, and a mix of both.
Our Framework Enables Any LLM to Handle Unlimited Working Memory. 🧞
Our Framework Enables Any LLM to Handle Unlimited Working Memory. 🧞
Time is Valuable. Faster Long-context Decoding with Efficient Attention
We introduce our state-of-the-art training-free sub-quadratic-cost HiP attention. We challenge the slow quadratic complexity of the attention mechanism by developing Hierarchically Pruned Attention (HiP), which exploits the attention score locality to accelerate pre-trained Transformer models. Simply said, this means that our HiP gets rid of redundant computations by cheaply & dynamically finding which part of the long context is important, relying on the characteristics of natural language: sentences with close proximity share more similarity than those that are distant. Using our LLM serving framework, every user can accelerate their end-to-end throughput by at least 2x and decode throughput by up to 5x compared to FlashAttention, with minimal performance degradation in long contexts.
Truly Training-free Long-context. Unlimited Context Length Using Rolling RoPE
Accelerating the inference speed of language models is only half of the story. Existing language models only produce good responses up to the context length of the data that it was trained on, which is around 100k words for ChatGPT and around 1 million for Gemini Pro.
This is why we not only accelerate the inference speed but also extend the usable context length by extending the positional embeddings using the rolling RoPE indexing first proposed in StreamingLLM. Unlike StreamingLLM, we apply the rolling RoPE indexing after dynamically selecting relevant context representations. Therefore, the usable context length is extended without discarding any part of the past context, which is unlike StreamingLLM or KV eviction policies, while having extended context length.
Memory is the Real Problem. Smaller KV Caches are the Solution
Extending the context to millions of words is not without downsides: more memory space is needed to store the intermediate results (KV cache) for each inference. However, we get around this problem by using our efficient KV cache offloading scheme.
This allows us to extend the serving context length up to an unprecedented 512k tokens on a single 80GB GPU. We are currently working on integrating it with our language model serving framework. Our HiP attention and DeepAuto.ai Model Acceleration framework will soon make serving extremely long contexts using much cheaper GPUs such as Nvidia L40 possible.
We Are Looking Forward to Contribute to the World
All members of our research team are natural-born frontier researchers from KAIST. Not only are we excited to build novel end-to-end LLM infrastructure, but we are also excited to contribute our excellent research results to the community by open-sourcing our core algorithm and framework. We believe communicating with the research and developer community by sharing our codebase is extremely important as an AI tech venture, because it allows us to test our algorithm and framework in more extreme and various settings. We value any and all feedback from users and researchers, so feel free to join our GitHub issues and make pull requests!
Our Long-Context Benchmarks
Our Long-Context Benchmarks
Our Long-Context Benchmarks
Latency Evaluation
Latency Evaluation
Latency Evaluation






18.95x Faster Decoding, 12x Longer Context. A dashed line means the SRT (SGlang Runtime) and HiP without KV cache offloading cannot handle longer context over 128K with a single GPU. Our HiP Extend (InfiniteHiP) can extend context length up to CPU memory limitation while providing a few ten times faster decoding throughput than estimated SRT decoding throughput.
Performance Evaluation
Performance Evaluation
Performance Evaluation






Extending the context window unlimitedly improves performance by about 20%. Ours can work perfectly with any pretrained LLM without further training.
Introducing HiP: The Log-Linear Leverage for Next-Gen Model Serving
Introducing HiP: The Log-Linear Leverage for Next-Gen Model Serving
The AI world is obsessed with scaling Large Language Models (LLMs). We're cramming them with more parameters, feeding them terabytes of data, and pushing them to reason over longer and longer contexts. This quest for expansive understanding unlocks exciting possibilities — chatbots that devour entire PDFs, retrieval-augmented reasoning engines that sift through colossal knowledge bases, and code assistants that analyze sprawling repositories in the blink of an eye.
But there's a catch: traditional self-attention, the beating heart of these models, scales quadratically with context length ($O(T^2)$). This means serving massive LLMs, especially those designed for long contexts, can become a computational nightmare.
Enter HiP (Hierarchically Pruned) Attention: an innovative technique that turbocharges long-context LLMs, enabling unprecedented efficiency and scale with minimal performance degradation. HiP re-imagines how attention tokens are selected, slashing complexity to $O(T \log T)$ – a logarithmic leap that unleashes the true potential of LLMs grappling with extensive text. And the best part? It's completely training-free and seamlessly integrates with existing serving frameworks. There is no retraining, no specialized hardware, just pure performance gains.
The AI world is obsessed with scaling Large Language Models (LLMs). We're cramming them with more parameters, feeding them terabytes of data, and pushing them to reason over longer and longer contexts. This quest for expansive understanding unlocks exciting possibilities — chatbots that devour entire PDFs, retrieval-augmented reasoning engines that sift through colossal knowledge bases, and code assistants that analyze sprawling repositories in the blink of an eye.
But there's a catch: traditional self-attention, the beating heart of these models, scales quadratically with context length ($O(T^2)$). This means serving massive LLMs, especially those designed for long contexts, can become a computational nightmare.
Enter HiP (Hierarchically Pruned) Attention: an innovative technique that turbocharges long-context LLMs, enabling unprecedented efficiency and scale with minimal performance degradation. HiP re-imagines how attention tokens are selected, slashing complexity to $O(T \log T)$ – a logarithmic leap that unleashes the true potential of LLMs grappling with extensive text. And the best part? It's completely training-free and seamlessly integrates with existing serving frameworks. There is no retraining, no specialized hardware, just pure performance gains.
The AI world is obsessed with scaling Large Language Models (LLMs). We're cramming them with more parameters, feeding them terabytes of data, and pushing them to reason over longer and longer contexts. This quest for expansive understanding unlocks exciting possibilities — chatbots that devour entire PDFs, retrieval-augmented reasoning engines that sift through colossal knowledge bases, and code assistants that analyze sprawling repositories in the blink of an eye.
But there's a catch: traditional self-attention, the beating heart of these models, scales quadratically with context length ($O(T^2)$). This means serving massive LLMs, especially those designed for long contexts, can become a computational nightmare.
Enter HiP (Hierarchically Pruned) Attention: an innovative technique that turbocharges long-context LLMs, enabling unprecedented efficiency and scale with minimal performance degradation. HiP re-imagines how attention tokens are selected, slashing complexity to $O(T \log T)$ – a logarithmic leap that unleashes the true potential of LLMs grappling with extensive text. And the best part? It's completely training-free and seamlessly integrates with existing serving frameworks. There is no retraining, no specialized hardware, just pure performance gains.
Why HiP? Speed That Doesn't Falter When the Text Gets Longer
Why HiP? Speed That Doesn't Falter When the Text Gets Longer
This is why we not only accelerate the inference speed but also extend the effective context length by extending the positional embeddings using the rolling RoPE indexing first proposed in StreamingLLM. Unlike StreamingLLM, we apply the rolling RoPE indexing after dynamically selecting relevant context representations. Therefore, the usable context length is extended without discarding any part of the past context, which is unlike StreamingLLM or KV eviction policies, while having extended context length.
Blazing-Fast Prefill: HiP dramatically accelerates the initial processing of long contexts, ensuring your LLM is ready to respond in record time. This is achieved through its hierarchical pruning algorithm, which quickly identifies the most relevant tokens without exhaustive comparisons.
Efficient Decoding: Generate text from long-context LLMs with unprecedented speed and responsiveness. HiP's logarithmic complexity ensures that decoding remains efficient even as the generated text grows longer.
Reduced Memory Footprint: HiP intelligently manages memory by offloading less frequently accessed tokens to system DRAM, allowing you to handle massive contexts even on resource-constrained hardware.
Effortless Integration: No need to retrain your models or invest in exotic hardware. HiP is a plug-and-play solution that delivers instant performance gains.
This is why we not only accelerate the inference speed but also extend the effective context length by extending the positional embeddings using the rolling RoPE indexing first proposed in StreamingLLM. Unlike StreamingLLM, we apply the rolling RoPE indexing after dynamically selecting relevant context representations. Therefore, the usable context length is extended without discarding any part of the past context, which is unlike StreamingLLM or KV eviction policies, while having extended context length.
Blazing-Fast Prefill: HiP dramatically accelerates the initial processing of long contexts, ensuring your LLM is ready to respond in record time. This is achieved through its hierarchical pruning algorithm, which quickly identifies the most relevant tokens without exhaustive comparisons.
Efficient Decoding: Generate text from long-context LLMs with unprecedented speed and responsiveness. HiP's logarithmic complexity ensures that decoding remains efficient even as the generated text grows longer.
Reduced Memory Footprint: HiP intelligently manages memory by offloading less frequently accessed tokens to system DRAM, allowing you to handle massive contexts even on resource-constrained hardware.
Effortless Integration: No need to retrain your models or invest in exotic hardware. HiP is a plug-and-play solution that delivers instant performance gains.
This is why we not only accelerate the inference speed but also extend the effective context length by extending the positional embeddings using the rolling RoPE indexing first proposed in StreamingLLM. Unlike StreamingLLM, we apply the rolling RoPE indexing after dynamically selecting relevant context representations. Therefore, the usable context length is extended without discarding any part of the past context, which is unlike StreamingLLM or KV eviction policies, while having extended context length.
Blazing-Fast Prefill: HiP dramatically accelerates the initial processing of long contexts, ensuring your LLM is ready to respond in record time. This is achieved through its hierarchical pruning algorithm, which quickly identifies the most relevant tokens without exhaustive comparisons.
Efficient Decoding: Generate text from long-context LLMs with unprecedented speed and responsiveness. HiP's logarithmic complexity ensures that decoding remains efficient even as the generated text grows longer.
Reduced Memory Footprint: HiP intelligently manages memory by offloading less frequently accessed tokens to system DRAM, allowing you to handle massive contexts even on resource-constrained hardware.
Effortless Integration: No need to retrain your models or invest in exotic hardware. HiP is a plug-and-play solution that delivers instant performance gains.
Deconstructing the Complexity: How HiP Works
Deconstructing the Complexity: How HiP Works
Deconstructing the Complexity: How HiP Works
HiP's secret weapon is a hierarchical pruning algorithm that transforms how we identify the most important tokens in a long sequence. Instead of exhaustively comparing every query token to all T key tokens (an O(T^2) operation), HiP employs a divide-and-conquer strategy that strategically narrows down the search space
HiP's secret weapon is a hierarchical pruning algorithm that transforms how we identify the most important tokens in a long sequence. Instead of exhaustively comparing every query token to all T key tokens (an O(T^2) operation), HiP employs a divide-and-conquer strategy that strategically narrows down the search space
HiP's secret weapon is a hierarchical pruning algorithm that transforms how we identify the most important tokens in a long sequence. Instead of exhaustively comparing every query token to all T key tokens (an O(T^2) operation), HiP employs a divide-and-conquer strategy that strategically narrows down the search space
Hierarchical Pruning: A Treasure Map to the Top-k Important Key Tokens
Hierarchical Pruning: A Treasure Map to the Top-k Important Key Tokens
Hierarchical Pruning: A Treasure Map to the Top-k Important Key Tokens



Divide and Conquer: HiP starts by dividing the sequence of $T$ tokens into $2k$ equal-sized segments, where $k$ is the desired number of key tokens. Each segment is represented by its "center token" – a clever trick that exploits the inherent "locality" of attention, where nearby tokens tend to have similar importance.
Approximate Chunk Importance: For each query token, HiP calculates attention scores against only these center tokens. This quick scan pinpoints the important segments worth investigating further. It's like consulting a treasure finding – you first identify the promising regions before digging for gold.
Iterative Refinement: HiP then zooms in on the top-$k$ (selecting half) segments and divides them again, repeating the process with increasing granularity until each chosen segment collapses to a single token. This iterative refinement, akin to navigating a hierarchical tree, converges on the top-$k$ key tokens in logarithmic time ($\log T$).
[Image: A diagram illustrating the iterative refinement process, with each level zooming in on the most important segments]
The result? A dramatic reduction in complexity from $O(T^2)$ to $O(T \log T)$ without significant performance drop.
Attention Locality: The Key to Efficiency
Attention Locality: The Key to Efficiency
Attention Locality: The Key to Efficiency
HiP's effectiveness hinges on the principle of "attention locality" – the observation that if a token in a neighborhood scores highly, its neighbors likely do too. Empirically observed in many transformer models, this property allows HiP to efficiently prune the search space by relying on center tokens as reliable proxies for their surrounding neighborhoods.
HiP's effectiveness hinges on the principle of "attention locality" – the observation that if a token in a neighborhood scores highly, its neighbors likely do too. Empirically observed in many transformer models, this property allows HiP to efficiently prune the search space by relying on center tokens as reliable proxies for their surrounding neighborhoods.
HiP's effectiveness hinges on the principle of "attention locality" – the observation that if a token in a neighborhood scores highly, its neighbors likely do too. Empirically observed in many transformer models, this property allows HiP to efficiently prune the search space by relying on center tokens as reliable proxies for their surrounding neighborhoods.
Taming the Beast with Hardware-Aware Optimizations
Taming the Beast with Hardware-Aware Optimizations



HiP doesn't stop at algorithmic ingenuity. It's also engineered to harness the raw power of modern GPUs:
Block-Wise Tiled Computation: HiP's core operations are structured to exploit the parallel processing capabilities of GPU matrix multiplication units (MMUs, like TensorCores). This ensures maximum throughput and minimizes expensive memory transfers.
OpenAI Triton Implementation: HiP leverages Triton, a high-performance GPU programming framework, to achieve both flexibility and speed. As a result, we could achieve a 4.31x speedup in end-to-end decoding at 32k context length with minimal performance degradation (Table 5).
Memory Efficiency: Handling Massive Contexts
Memory Efficiency: Handling Massive Contexts



Long contexts can overwhelm even the most powerful GPUs. HiP tackles this challenge with a two-tiered memory strategy:
On-GPU Cache: Only the most frequently accessed tokens, identified by HiP's top-k estimation, reside in precious GPU memory.
CPU Offloading: Less frequently used tokens are strategically offloaded to system DRAM, ready to be fetched when needed.
[Image: A diagram illustrating the two-level memory hierarchy with GPU cache and CPU offloading]
This dynamic memory management allows HiP to scale to massive context lengths without requiring a fleet of GPUs. On a single RTX 4090, HiP can comfortably handle 64k tokens, and with an A100, it can even reach 512k tokens, all with minimal throughput loss.
HiP: Unleashing the Power of Long-Context LLMs
HiP: Unleashing the Power of Long-Context LLMs



HiP is more than just an algorithm; it's a complete framework for serving long-context LLMs with unparalleled efficiency:
Hierarchical Pruning: Logarithmic complexity for lightning-fast inference.
Hardware Optimization: Exploiting GPU parallelism for maximum throughput.
Intelligent Offloading: Dynamic memory management for extreme scalability.
The result? A seamless, training-free, and high-performance solution for long-context LLM serving.
Ready to Turbocharge Your Language Model Pipeline?
Ready to Turbocharge Your Language Model Pipeline?
Ready to Turbocharge Your Language Model Pipeline?
If your AI ambitions include long-context reasoning, multi-document summarization, or advanced retrieval-augmented inference, HiP is your weapon of choice. It empowers you to:
Scale effortlessly: Handle massive contexts without breaking the bank.
Boost performance: Deliver real-time responsiveness to your users.
Stay ahead of the curve: Embrace the future of long-context AI.
If your AI ambitions include long-context reasoning, multi-document summarization, or advanced retrieval-augmented inference, HiP is your weapon of choice. It empowers you to:
Scale effortlessly: Handle massive contexts without breaking the bank.
Boost performance: Deliver real-time responsiveness to your users.
Stay ahead of the curve: Embrace the future of long-context AI.
If your AI ambitions include long-context reasoning, multi-document summarization, or advanced retrieval-augmented inference, HiP is your weapon of choice. It empowers you to:
Scale effortlessly: Handle massive contexts without breaking the bank.
Boost performance: Deliver real-time responsiveness to your users.
Stay ahead of the curve: Embrace the future of long-context AI.
LLM Context Length is Still Limited
LLM Context Length is Still Limited
Our journey is not yet finished. Our HiP allows us to extend the context length without significant degradation of processing throughput. However, the context memory of LLM is still limited due to performance degradation in a long context. The LLM is pretrained with some certain limits, such as 128K for OpenAI ChatGPT. To tackle this issue and further improve the performance and latency we proposed InfiniteHiP.
Our journey is not yet finished. Our HiP allows us to extend the context length without significant degradation of processing throughput. However, the context memory of LLM is still limited due to performance degradation in a long context. The LLM is pretrained with some certain limits, such as 128K for OpenAI ChatGPT. To tackle this issue and further improve the performance and latency we proposed InfiniteHiP.
Our journey is not yet finished. Our HiP allows us to extend the context length without significant degradation of processing throughput. However, the context memory of LLM is still limited due to performance degradation in a long context. The LLM is pretrained with some certain limits, such as 128K for OpenAI ChatGPT. To tackle this issue and further improve the performance and latency we proposed InfiniteHiP.
Introducing InfiniteHiP: Dynamically Extending Language Model Context Beyond Millions
Introducing InfiniteHiP: Dynamically Extending Language Model Context Beyond Millions
In the ever-evolving world of artificial intelligence, the ability of large language models (LLMs) to process long contexts has been a game changer. However, pushing context lengths into the millions has been constrained by high memory costs, slow inference speeds, and the inherent limitations of pre-trained models. Enter InfiniteHiP, a revolutionary framework designed to overcome these challenges without additional training.



The Challenge of Long Contexts
The Challenge of Long Contexts
The Challenge of Long Contexts
Modern LLMs rely on attention mechanisms to establish relationships across input tokens. This computation grows exponentially with context length, leading to inefficiencies. Additionally, pre-trained models struggle to handle sequences longer than their training limit, limiting their utility in real-world applications like summarization, multi-document analysis, or code generation.
Modern LLMs rely on attention mechanisms to establish relationships across input tokens. This computation grows exponentially with context length, leading to inefficiencies. Additionally, pre-trained models struggle to handle sequences longer than their training limit, limiting their utility in real-world applications like summarization, multi-document analysis, or code generation.
Modern LLMs rely on attention mechanisms to establish relationships across input tokens. This computation grows exponentially with context length, leading to inefficiencies. Additionally, pre-trained models struggle to handle sequences longer than their training limit, limiting their utility in real-world applications like summarization, multi-document analysis, or code generation.
InfiniteHiP: A Paradigm Shift
InfiniteHiP: A Paradigm Shift
InfiniteHiP: A Paradigm Shift
InfiniteHiP introduces a training-free approach that dynamically extends LLM context lengths beyond millions of tokens, addressing three key challenges:
Efficient Long Context Processing: Using a modular, hierarchical token pruning system, InfiniteHiP prunes irrelevant tokens in real time without losing valuable context. This speeds up both the prefill and decoding stages.
Out-of-Length Generalization: InfiniteHiP adjusts positional embeddings, allowing models to generalize beyond their pre-trained context limits.
Memory Optimization: The framework reduces GPU memory usage by offloading the key-value (KV) cache to host memory, enabling long-context processing on standard hardware.
InfiniteHiP introduces a training-free approach that dynamically extends LLM context lengths beyond millions of tokens, addressing three key challenges:
Efficient Long Context Processing: Using a modular, hierarchical token pruning system, InfiniteHiP prunes irrelevant tokens in real time without losing valuable context. This speeds up both the prefill and decoding stages.
Out-of-Length Generalization: InfiniteHiP adjusts positional embeddings, allowing models to generalize beyond their pre-trained context limits.
Memory Optimization: The framework reduces GPU memory usage by offloading the key-value (KV) cache to host memory, enabling long-context processing on standard hardware.
InfiniteHiP introduces a training-free approach that dynamically extends LLM context lengths beyond millions of tokens, addressing three key challenges:
Efficient Long Context Processing: Using a modular, hierarchical token pruning system, InfiniteHiP prunes irrelevant tokens in real time without losing valuable context. This speeds up both the prefill and decoding stages.
Out-of-Length Generalization: InfiniteHiP adjusts positional embeddings, allowing models to generalize beyond their pre-trained context limits.
Memory Optimization: The framework reduces GPU memory usage by offloading the key-value (KV) cache to host memory, enabling long-context processing on standard hardware.
How Does It Work?
How Does It Work?
How Does It Work?



The magic lies in InfiniteHiP's threefold innovation:
Hierarchical Token Pruning: Irrelevant tokens are pruned stepwise, dramatically reducing computational overhead while maintaining essential context.
Paged Block Sparse Attention: Only the most relevant token blocks are processed, achieving efficiency without compromising accuracy.
Unified Memory Management: By dynamically offloading "cold" tokens to host memory, InfiniteHiP bypasses GPU VRAM limitations and can process up to 3 million tokens on a single GPU.
The magic lies in InfiniteHiP's threefold innovation:
Hierarchical Token Pruning: Irrelevant tokens are pruned stepwise, dramatically reducing computational overhead while maintaining essential context.
Paged Block Sparse Attention: Only the most relevant token blocks are processed, achieving efficiency without compromising accuracy.
Unified Memory Management: By dynamically offloading "cold" tokens to host memory, InfiniteHiP bypasses GPU VRAM limitations and can process up to 3 million tokens on a single GPU.
The magic lies in InfiniteHiP's threefold innovation:
Hierarchical Token Pruning: Irrelevant tokens are pruned stepwise, dramatically reducing computational overhead while maintaining essential context.
Paged Block Sparse Attention: Only the most relevant token blocks are processed, achieving efficiency without compromising accuracy.
Unified Memory Management: By dynamically offloading "cold" tokens to host memory, InfiniteHiP bypasses GPU VRAM limitations and can process up to 3 million tokens on a single GPU.
Real-World Impact
Real-World Impact
Real-World Impact
InfiniteHiP has been implemented within the SGLang framework, showcasing up to 18.95x speedups in 1-million-token context decoding. Benchmarked against state-of-the-art methods like FlashAttention and InfLLM, InfiniteHiP consistently outperforms speed and scalability while preserving accuracy.
InfiniteHiP has been implemented within the SGLang framework, showcasing up to 18.95x speedups in 1-million-token context decoding. Benchmarked against state-of-the-art methods like FlashAttention and InfLLM, InfiniteHiP consistently outperforms speed and scalability while preserving accuracy.
InfiniteHiP has been implemented within the SGLang framework, showcasing up to 18.95x speedups in 1-million-token context decoding. Benchmarked against state-of-the-art methods like FlashAttention and InfLLM, InfiniteHiP consistently outperforms speed and scalability while preserving accuracy.
Why It Matters
Why It Matters
Why It Matters
InfiniteHiP is not just a technical achievement but a step forward in making LLMs more practical for tasks requiring extensive context comprehension. Whether it's long-form summarization, multi-modal data processing, or complex query generation, InfiniteHiP equips LLMs with the tools to excel.
InfiniteHiP is not just a technical achievement but a step forward in making LLMs more practical for tasks requiring extensive context comprehension. Whether it's long-form summarization, multi-modal data processing, or complex query generation, InfiniteHiP equips LLMs with the tools to excel.
InfiniteHiP is not just a technical achievement but a step forward in making LLMs more practical for tasks requiring extensive context comprehension. Whether it's long-form summarization, multi-modal data processing, or complex query generation, InfiniteHiP equips LLMs with the tools to excel.
Try It Today!
Try It Today!
Try It Today!
InfiniteHiP is seamlessly compatible with existing Transformer-based LLMs, offering developers and researchers an easy-to-adopt solution. Join the revolution and explore how InfiniteHiP can redefine the limits of language modeling.
InfiniteHiP is seamlessly compatible with existing Transformer-based LLMs, offering developers and researchers an easy-to-adopt solution. Join the revolution and explore how InfiniteHiP can redefine the limits of language modeling.
InfiniteHiP is seamlessly compatible with existing Transformer-based LLMs, offering developers and researchers an easy-to-adopt solution. Join the revolution and explore how InfiniteHiP can redefine the limits of language modeling.
Dive deeper into the technical details
Dive deeper into the technical details
Dive deeper into the technical details
Cite our Paper!
Cite our Paper!
Cite our Paper!
@misc{lee2024_hip_attention,
title={A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention},
author={Heejun Lee and Geon Park and Youngwan Lee and Jaduk Suh and Jina Kim and Wonyoung Jeong and Bumsik Kim and Hyemin Lee and Myeongjae Jeon and Sung Ju Hwang},
year={2024},
eprint={2406.09827},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.09827},
}
@misc{lee2024_hip_attention,
title={A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention},
author={Heejun Lee and Geon Park and Youngwan Lee and Jaduk Suh and Jina Kim and Wonyoung Jeong and Bumsik Kim and Hyemin Lee and Myeongjae Jeon and Sung Ju Hwang},
year={2024},
eprint={2406.09827},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.09827},
}
@misc{lee2024_hip_attention,
title={A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention},
author={Heejun Lee and Geon Park and Youngwan Lee and Jaduk Suh and Jina Kim and Wonyoung Jeong and Bumsik Kim and Hyemin Lee and Myeongjae Jeon and Sung Ju Hwang},
year={2024},
eprint={2406.09827},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.09827},
}
Cascading KV Cache: Linear Time Exponential Context Extension
Cascading KV Cache: Linear Time Exponential Context Extension
OUr previously described HiP attention mechanism lowers the overall attention complexity to O(T log T), however in the case of limited computational resources it may be desirable to achieve an even lower memory and time complexity of O(T). Sliding window attention in the forms of StreamingLLM achieves such linear complexity, however, it comes with one crucial downside; Once a token reaches the end of the sliding window, it is evicted from the cache, and therefore cannot influence future token predictions. Simply put, all memory of the token is erased, and therefore the LLM cannot remember previous information which is too far outside of the sliding window.
We overcome this obstacle by making a KV cache which operates as a collection of multiple sub-windows, each which accepts tokens with a different rate and allows for extending the range of historical tokens which are stored in the KV cache. The effect is that less important tokens are evicted from the cache in order to keep more important historical tokens in the working memory, which effectively extends the memory of a sliding window while maintaining the same linear complexity as Streaming LLM.
OUr previously described HiP attention mechanism lowers the overall attention complexity to O(T log T), however in the case of limited computational resources it may be desirable to achieve an even lower memory and time complexity of O(T). Sliding window attention in the forms of StreamingLLM achieves such linear complexity, however, it comes with one crucial downside; Once a token reaches the end of the sliding window, it is evicted from the cache, and therefore cannot influence future token predictions. Simply put, all memory of the token is erased, and therefore the LLM cannot remember previous information which is too far outside of the sliding window.
We overcome this obstacle by making a KV cache which operates as a collection of multiple sub-windows, each which accepts tokens with a different rate and allows for extending the range of historical tokens which are stored in the KV cache. The effect is that less important tokens are evicted from the cache in order to keep more important historical tokens in the working memory, which effectively extends the memory of a sliding window while maintaining the same linear complexity as Streaming LLM.
OUr previously described HiP attention mechanism lowers the overall attention complexity to O(T log T), however in the case of limited computational resources it may be desirable to achieve an even lower memory and time complexity of O(T). Sliding window attention in the forms of StreamingLLM achieves such linear complexity, however, it comes with one crucial downside; Once a token reaches the end of the sliding window, it is evicted from the cache, and therefore cannot influence future token predictions. Simply put, all memory of the token is erased, and therefore the LLM cannot remember previous information which is too far outside of the sliding window.
We overcome this obstacle by making a KV cache which operates as a collection of multiple sub-windows, each which accepts tokens with a different rate and allows for extending the range of historical tokens which are stored in the KV cache. The effect is that less important tokens are evicted from the cache in order to keep more important historical tokens in the working memory, which effectively extends the memory of a sliding window while maintaining the same linear complexity as Streaming LLM.
Cascading Cache Windows
Cascading Cache Windows



in Streaming LLM (top, Sink Attention), a few static tokens are kept in the beginning of the sequence, in addition to a fixed size sliding window (blue tokens). However, when a new tokens arrives (purple token), one of the blue tokens must be evicted at the end of the cache in order to make room for the new purple tokens which must be added to the cache.
However, our Cascading KV Cache (bottom) uses multiple sub-caches which are depicted by the different tokens colors (blue, green yellow, orange, red). Like Streaming LLM, we maintain a fixed number of initial tokens, but instead of having only one exit route from the cache, we may force tokens to exit in between the sub-caches. In effect this evicts a newer token and preferentially maintains an older token in the cache which is how we extend the working memory of the LLM. Additionally, the different sub-caches accept tokens according to a different schedule. We use an exponential schedule where each cache accepts half of the tokens which are evicted from the previous cache.
in Streaming LLM (top, Sink Attention), a few static tokens are kept in the beginning of the sequence, in addition to a fixed size sliding window (blue tokens). However, when a new tokens arrives (purple token), one of the blue tokens must be evicted at the end of the cache in order to make room for the new purple tokens which must be added to the cache.
However, our Cascading KV Cache (bottom) uses multiple sub-caches which are depicted by the different tokens colors (blue, green yellow, orange, red). Like Streaming LLM, we maintain a fixed number of initial tokens, but instead of having only one exit route from the cache, we may force tokens to exit in between the sub-caches. In effect this evicts a newer token and preferentially maintains an older token in the cache which is how we extend the working memory of the LLM. Additionally, the different sub-caches accept tokens according to a different schedule. We use an exponential schedule where each cache accepts half of the tokens which are evicted from the previous cache.
in Streaming LLM (top, Sink Attention), a few static tokens are kept in the beginning of the sequence, in addition to a fixed size sliding window (blue tokens). However, when a new tokens arrives (purple token), one of the blue tokens must be evicted at the end of the cache in order to make room for the new purple tokens which must be added to the cache.
However, our Cascading KV Cache (bottom) uses multiple sub-caches which are depicted by the different tokens colors (blue, green yellow, orange, red). Like Streaming LLM, we maintain a fixed number of initial tokens, but instead of having only one exit route from the cache, we may force tokens to exit in between the sub-caches. In effect this evicts a newer token and preferentially maintains an older token in the cache which is how we extend the working memory of the LLM. Additionally, the different sub-caches accept tokens according to a different schedule. We use an exponential schedule where each cache accepts half of the tokens which are evicted from the previous cache.
Cascading Cache Windows
Cascading Cache Windows



Our goal is to keep important historical tokens present in the cache for as long as possible, and therefore, to this end we also implement a smart and dynamic eviction policy which favors keeping tokens in the cache according to their historical attention score.
Our goal is to keep important historical tokens present in the cache for as long as possible, and therefore, to this end we also implement a smart and dynamic eviction policy which favors keeping tokens in the cache according to their historical attention score.
Our goal is to keep important historical tokens present in the cache for as long as possible, and therefore, to this end we also implement a smart and dynamic eviction policy which favors keeping tokens in the cache according to their historical attention score.
Case 1: Cache is not Full
Case 1: Cache is not Full
Case 1: Cache is not Full
If the cache is not full, we are eager to add evicted tokens unconditionally to the next sub-cache in order to effectively utilize the empty space.
If the cache is not full, we are eager to add evicted tokens unconditionally to the next sub-cache in order to effectively utilize the empty space.
If the cache is not full, we are eager to add evicted tokens unconditionally to the next sub-cache in order to effectively utilize the empty space.
Case 2: Cache is full, and next cache is accepting tokens
Case 2: Cache is full, and next cache is accepting tokens
Case 2: Cache is full, and next cache is accepting tokens
If the cache is full, and the next sub-cache accepts tokens via the previously described token acceptance schedule, In this case, we evict a token and add it to the next cache, which may cascade into another eviction for the third cache, etc. (This is where the term ‘cascade’ comes from in ‘Cascading KV Cache’)
If the cache is full, and the next sub-cache accepts tokens via the previously described token acceptance schedule, In this case, we evict a token and add it to the next cache, which may cascade into another eviction for the third cache, etc. (This is where the term ‘cascade’ comes from in ‘Cascading KV Cache’)
If the cache is full, and the next sub-cache accepts tokens via the previously described token acceptance schedule, In this case, we evict a token and add it to the next cache, which may cascade into another eviction for the third cache, etc. (This is where the term ‘cascade’ comes from in ‘Cascading KV Cache’)
Case 3: Cache is full, the unimportant token is being evicted
Case 3: Cache is full, the unimportant token is being evicted
Case 3: Cache is full, the unimportant token is being evicted
In this case, we must evict a token from the current cache, but the next sub-cache will not accept tokens according to the predefined acceptance schedule. In this case, we compare the oldest token in the current cache with the newest token in the next cache. If the token being evicted from the current cache has a lower attention score than the newest token in the next cache. We preferentially keep the token in the next cache with the higher attention score.
In this case, we must evict a token from the current cache, but the next sub-cache will not accept tokens according to the predefined acceptance schedule. In this case, we compare the oldest token in the current cache with the newest token in the next cache. If the token being evicted from the current cache has a lower attention score than the newest token in the next cache. We preferentially keep the token in the next cache with the higher attention score.
In this case, we must evict a token from the current cache, but the next sub-cache will not accept tokens according to the predefined acceptance schedule. In this case, we compare the oldest token in the current cache with the newest token in the next cache. If the token being evicted from the current cache has a lower attention score than the newest token in the next cache. We preferentially keep the token in the next cache with the higher attention score.
Case 4: Cache is full, the important token is being evicted
Case 4: Cache is full, the important token is being evicted
Case 4: Cache is full, the important token is being evicted
This is the opposite case to case 3. We must evict a token from the current cache, but the token being evicted from the current cache is more important than the newest token in the next cache. In this case, we prefer to keep the more important token and discard the newest token in the next cache.
This is the opposite case to case 3. We must evict a token from the current cache, but the token being evicted from the current cache is more important than the newest token in the next cache. In this case, we prefer to keep the more important token and discard the newest token in the next cache.
This is the opposite case to case 3. We must evict a token from the current cache, but the token being evicted from the current cache is more important than the newest token in the next cache. In this case, we prefer to keep the more important token and discard the newest token in the next cache.
Strided Batched Prefill Strategy
Strided Batched Prefill Strategy
A key limitation of prior works on linear KV caching relates to the prefill strategy that is utilized. Streaming LLM performs prefill by adding one token at a time, which effectively slows the prefill speed to that of decoding, which is much slower than the prefill latency which could be achieved by FlashAttention on realistic context lengths. Other recent works, such as SnapKV and H2O, only provide linear time decoding attention while requiring full quadratic attention during the prefill stage. We focus our work on both linear time prefill and decoding by utilizing a batched prefill strategy which performs attention for a large chunk of tokens at once before adding them to the cache and subsequently performing attention for the next chunk in the sequence.
(TODO: this should be smaller and inline with text wrapped around it)
A key limitation of prior works on linear KV caching relates to the prefill strategy that is utilized. Streaming LLM performs prefill by adding one token at a time, which effectively slows the prefill speed to that of decoding, which is much slower than the prefill latency which could be achieved by FlashAttention on realistic context lengths. Other recent works, such as SnapKV and H2O, only provide linear time decoding attention while requiring full quadratic attention during the prefill stage. We focus our work on both linear time prefill and decoding by utilizing a batched prefill strategy which performs attention for a large chunk of tokens at once before adding them to the cache and subsequently performing attention for the next chunk in the sequence.
(TODO: this should be smaller and inline with text wrapped around it)
A key limitation of prior works on linear KV caching relates to the prefill strategy that is utilized. Streaming LLM performs prefill by adding one token at a time, which effectively slows the prefill speed to that of decoding, which is much slower than the prefill latency which could be achieved by FlashAttention on realistic context lengths. Other recent works, such as SnapKV and H2O, only provide linear time decoding attention while requiring full quadratic attention during the prefill stage. We focus our work on both linear time prefill and decoding by utilizing a batched prefill strategy which performs attention for a large chunk of tokens at once before adding them to the cache and subsequently performing attention for the next chunk in the sequence.
(TODO: this should be smaller and inline with text wrapped around it)



The effects of this on long contexts are enormous. For context lengths of one million tokens with a total cache size of 16K, we can process a one million token prefill using only 15% of the wall clock time of flash attention on an NVIDIA A6000 GPU! (subfigure b below) Additionally, we created a custom Triton kernel for our cache which makes use of circular buffers to avoid tensor concatenation in the caching process which greatly reduces the latency of the cache time as well. (subfigure a below)
The effects of this on long contexts are enormous. For context lengths of one million tokens with a total cache size of 16K, we can process a one million token prefill using only 15% of the wall clock time of flash attention on an NVIDIA A6000 GPU! (subfigure b below) Additionally, we created a custom Triton kernel for our cache which makes use of circular buffers to avoid tensor concatenation in the caching process which greatly reduces the latency of the cache time as well. (subfigure a below)
The effects of this on long contexts are enormous. For context lengths of one million tokens with a total cache size of 16K, we can process a one million token prefill using only 15% of the wall clock time of flash attention on an NVIDIA A6000 GPU! (subfigure b below) Additionally, we created a custom Triton kernel for our cache which makes use of circular buffers to avoid tensor concatenation in the caching process which greatly reduces the latency of the cache time as well. (subfigure a below)



Memory Extension With a Fixed Cache Size
Memory Extension With a Fixed Cache Size
To showcase the ability to extend the memory given a fixed cache size, we evaluated our model vs. Streaming LLM on passkey retrieval up to 1M tokens with a fixed cache size of 65K and eight cascading sub-caches. On the x-axis, we display the context length, and on the y-axis, we display the insertion location of the passkey. As Streaming LLM naively evicts tokens from the 65K window once the cache is full, there is an immediate drop in accuracy for the deepest insertion location when the cache size doubles to 131K. However, at 131K, our model still maintains 100% accuracy at the deepest insertion location. Even after 4 doublings of the cache size to 1M tokens, our Cascading KV Cache still maintains better performance than Streaming LLM, indicating there is still useful memory being retrieved from the cache after far exceeding the 65K tokens stored in the cache.
To showcase the ability to extend the memory given a fixed cache size, we evaluated our model vs. Streaming LLM on passkey retrieval up to 1M tokens with a fixed cache size of 65K and eight cascading sub-caches. On the x-axis, we display the context length, and on the y-axis, we display the insertion location of the passkey. As Streaming LLM naively evicts tokens from the 65K window once the cache is full, there is an immediate drop in accuracy for the deepest insertion location when the cache size doubles to 131K. However, at 131K, our model still maintains 100% accuracy at the deepest insertion location. Even after 4 doublings of the cache size to 1M tokens, our Cascading KV Cache still maintains better performance than Streaming LLM, indicating there is still useful memory being retrieved from the cache after far exceeding the 65K tokens stored in the cache.
To showcase the ability to extend the memory given a fixed cache size, we evaluated our model vs. Streaming LLM on passkey retrieval up to 1M tokens with a fixed cache size of 65K and eight cascading sub-caches. On the x-axis, we display the context length, and on the y-axis, we display the insertion location of the passkey. As Streaming LLM naively evicts tokens from the 65K window once the cache is full, there is an immediate drop in accuracy for the deepest insertion location when the cache size doubles to 131K. However, at 131K, our model still maintains 100% accuracy at the deepest insertion location. Even after 4 doublings of the cache size to 1M tokens, our Cascading KV Cache still maintains better performance than Streaming LLM, indicating there is still useful memory being retrieved from the cache after far exceeding the 65K tokens stored in the cache.



Dive deeper into the technical details
Dive deeper into the technical details
Dive deeper into the technical details
Cite our Paper!
Cite our Paper!
Cite our Paper!
@article{willette2024training,
title={Training-Free Exponential Context Extension via Cascading KV Cache},
author={Willette, Jeffrey and Lee, Heejun and Lee, Youngwan and Jeon, Myeongjae and Hwang, Sung Ju},
journal={arXiv preprint arXiv:2406.17808},
year={2024}
}
@article{willette2024training,
title={Training-Free Exponential Context Extension via Cascading KV Cache},
author={Willette, Jeffrey and Lee, Heejun and Lee, Youngwan and Jeon, Myeongjae and Hwang, Sung Ju},
journal={arXiv preprint arXiv:2406.17808},
year={2024}
}
@article{willette2024training,
title={Training-Free Exponential Context Extension via Cascading KV Cache},
author={Willette, Jeffrey and Lee, Heejun and Lee, Youngwan and Jeon, Myeongjae and Hwang, Sung Ju},
journal={arXiv preprint arXiv:2406.17808},
year={2024}