Mooncake: A KVCache-centric Disaggregated
Architecture for LLM Serving

Mooncake is the serving platform for icon Kimi, a leading LLM service provided by icon Moonshot AI. This repository hosts its technical report and also the open sourced traces.

More will come - perhaps not very soon, but stay tuned!

🔥 Updates

July 9, 2024: We open sourced the trace as a jsonl file!.
June 27, 2024: We present a series of Chinese blogs with more discussions on zhihu 1, 2, 3, 4.
June 26, 2024: Initial technical report release.

🎉 Overview

Mooncake features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache.

The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs) requirements. Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges due to highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake’s innovative architecture enables Kimi to handle 75% more requests.

📦 Open Source Trace

{
    "timestamp": 27482,
    "input_length": 6955,
    "output_length": 52,
    "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2353, 2354]
}
{
    "timestamp": 30535,
    "input_length": 6472,
    "output_length": 26,
    "hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2366]
}

The above presents two samples from our trace dataset. The trace includes the timing of request arrivals, the number of input tokens, the number of output tokens, and the remapped block hash. To protect our customers' privacy, we applied several mechanisms to remove user-related information while preserving the dataset's utility for simulated evaluation. More descriptions of the trace (e.g., up to 50% cache hit ratio) can be found in Section 4 of the paper's Version 3.

📑 Citation

Please kindly cite our paper if you find the paper or the trace is useful:

@article{qin2024mooncake,
  title        = {Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving},
  author       = {Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu},
  year         = {2024},
  url          = {https://arxiv.org/abs/2407.00079}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
image		image
Mooncake-v3.pdf		Mooncake-v3.pdf
README.md		README.md
mooncake_trace.jsonl		mooncake_trace.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mooncake: A KVCache-centric Disaggregated
Architecture for LLM Serving

🔥 Updates

🎉 Overview

📦 Open Source Trace

📑 Citation

About

Releases

Packages

kvcache-ai/Mooncake

Folders and files

Latest commit

History

Repository files navigation

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

🔥 Updates

🎉 Overview

📦 Open Source Trace

📑 Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Mooncake: A KVCache-centric Disaggregated
Architecture for LLM Serving

Packages