Mooncake的设计哲学可以概括为“以更多的存储换取更少的计算”(Trading More Storage for Less Computation)15。它是在一个超大规模、长期处于过载状态的生产环境(即Kimi智能助手服务)中诞生并经过实战检验的。因此,其设计的首要目标是在严格遵守SLO的前提下,最大化有效吞吐量 14。
Dynamo:一个新兴的平台。虽然年轻,但它得到了NVIDIA的全力支持,后者在构建成功生态系统(如CUDA、Triton)方面拥有无可匹敌的经验。Dynamo与NVIDIA AI Enterprise和NIMs的集成,预示着其强大的企业级市场前景 23。虽然Dynamo自身的生产用户尚未公布,但其前身Triton已被亚马逊、微软、Snap等巨头广泛使用 22。
评估策略:对于任何考虑采用这些技术的团队,建议从特定工作负载概念验证(PoC)开始。评估重点不应仅仅是原始吞吐量,而应是针对常见用例(例如,长上下文RAG vs. 短交互对话)在特定SLO(如首token时间TTFT,token间时间TBT)下的性能表现。可以利用Dynamo文档中提到的GenAI-Perf等工具来创建贴近现实的基准测试 11。最终的选择,不应是哪个技术在真空中“最好”,而是在特定的生产工作负载和组织背景下,哪个技术是“最合适”的。
参考资料
What Is an AI Factory? | Supermicro, accessed September 6, 2025, https://www.supermicro.com/en/glossary/ai-factory
What is an AI Factory? | NVIDIA Glossary, accessed September 6, 2025, https://www.nvidia.com/en-us/glossary/ai-factory/
What Is an AI Factory? - F5, accessed September 6, 2025, https://www.f5.com/company/blog/defining-an-ai-factory
LLM Inference Series: 4. KV caching, a deeper look | by Pierre Lienhart | Medium, accessed September 6, 2025, https://medium.com/@plienhar/llm-inference-series-4-kv-caching-a-deeper-look-4ba9a77746c8
Accelerate Large-Scale LLM Inference and KV Cache Offload with CPU-GPU Memory Sharing | NVIDIA Technical Blog, accessed September 6, 2025, https://developer.nvidia.com/blog/accelerate-large-scale-llm-inference-and-kv-cache-offload-with-cpu-gpu-memory-sharing/
Huawei Launches AI Inference Technology (UCM) to Address ..., accessed September 6, 2025, https://en.eeworld.com.cn/mp/AIxintianxia/a405991.jspx
KV Cache Offloading - When is it Beneficial? - NetApp Community, accessed September 6, 2025, https://community.netapp.com/t5/Tech-ONTAP-Blogs/KV-Cache-Offloading-When-is-it-Beneficial/ba-p/462900
A Survey on Large Language Model Acceleration based on KV Cache Management - OpenReview, accessed September 6, 2025, https://openreview.net/pdf?id=z3JZzu9EA3
ai-dynamo/dynamo: A Datacenter Scale Distributed ... - GitHub, accessed September 6, 2025, https://github.com/ai-dynamo/dynamo
KV Caching in LLMs, explained visually - Daily Dose of Data Science, accessed September 6, 2025, https://www.dailydoseofds.com/p/kv-caching-in-llms-explained-visually/
Research team from Department of Computer Science and Technology wins Best Paper Award at FAST 2025, accessed September 6, 2025, https://www.cs.tsinghua.edu.cn/csen/info/1084/4580.htm
Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot - USENIX, accessed September 6, 2025, https://www.usenix.org/system/files/fast25-qin.pdf
On August 12, Huawei officially released UCM, an innovative AI inference technology. According to information, as an inference acceleration suite centered on KV Cache, UCM integrates multi-type cache acceleration algorithm tools to manage KV Cache memory data generated during the inference process in a hierarchical manner, which can expand the inference context window, achieve - Webull, accessed September 6, 2025, https://www.webull.com/news/13321810914796544
Huawei preps AI SSD to ease GPU memory bottlenecks - Blocks and Files, accessed September 6, 2025, https://blocksandfiles.com/2025/08/26/huawei-ai-ssd/
Inside Huawei's Breakthrough in AI Software - AI Magazine, accessed September 6, 2025, https://aimagazine.com/news/how-huaweis-ucm-software-boosts-ai-memory-efficiency
Huawei launches UCM algorithm as reliable alternative to HBM chips, accessed September 6, 2025, https://www.huaweicentral.com/huawei-launches-ucm-algorithm-as-reliable-alternative-to-hbm-chips/
Huawei Unveils UCM: The Gambit Raises Stakes In The AI Chip War — And Could Complicate Nvidia's China Deals - Tekedia, accessed September 6, 2025, https://www.tekedia.com/huawei-unveils-ucm-the-gambit-raises-stakes-in-the-ai-chip-war-and-could-complicate-nvidias-china-deals/
[News] Huawei Unveils UCM Algorithm to Cut HBM Reliance, Reportedly Goes Open-Source in September - TrendForce, accessed September 6, 2025, https://www.trendforce.com/news/2025/08/13/news-huawei-unveils-ucm-algorithm-to-cut-hbm-reliance-reportedly-goes-open-source-in-september/
NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models | NVIDIA Technical Blog, accessed September 6, 2025, https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/
Diving into Nvidia Dynamo: AI Inference at Scale - Gradient Flow, accessed September 6, 2025, https://gradientflow.com/ai-inference-nvidia-dynamo-ray-serve/
NVIDIA Dynamo: The Future Of High-Speed AI Inference - All Tech Magazine, accessed September 6, 2025, https://alltechmagazine.com/nvidia-dynamo-the-future-of-high-speed-ai-inference/
How to configure the KV Cache Manager to connect to external storage? #423 - GitHub, accessed September 6, 2025, https://github.com/ai-dynamo/dynamo/issues/423
[Papierüberprüfung] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving - Moonlight, accessed September 6, 2025, https://www.themoonlight.io/de/review/mooncake-a-kvcache-centric-disaggregated-architecture-for-llm-serving
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving - arXiv, accessed September 6, 2025, https://arxiv.org/pdf/2407.00079
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving - arXiv, accessed September 6, 2025, https://arxiv.org/html/2407.00079v2
Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot | USENIX, accessed September 6, 2025, https://www.usenix.org/conference/fast25/presentation/qin
LMCache/LMCache: Supercharge Your LLM with the ... - GitHub, accessed September 6, 2025, https://github.com/LMCache/LMCache
raw.githubusercontent.com, accessed September 6, 2025, https://raw.githubusercontent.com/LMCache/LMCache/dev/README.md
Dynamo Inference Framework - NVIDIA Developer, accessed September 6, 2025, https://developer.nvidia.com/dynamo
About Us | LMCache blog website, accessed September 6, 2025, https://blog.lmcache.ai/aboutme/
How Huawei's UCM Software Boosts AI Memory Efficiency | Telco Magazine, accessed September 6, 2025, https://telcomagazine.com/news/how-huaweis-ucm-software-boosts-ai-memory-efficiency
Releases · LMCache/LMCache - GitHub, accessed September 6, 2025, https://github.com/LMCache/LMCache/releases
LMCache/lmcache-tests - GitHub, accessed September 6, 2025, https://github.com/LMCache/lmcache-tests
arXiv:2504.03775v1 [cs.DC] 3 Apr 2025, accessed September 6, 2025, https://www.arxiv.org/pdf/2504.03775
Huawei launches UCM algorithm as reliable alternative to HBM chips : r/Sino - Reddit, accessed September 6, 2025, https://www.reddit.com/r/Sino/comments/1momwxc/huawei_launches_ucm_algorithm_as_reliable/
Moonshot AI · GitHub, accessed September 6, 2025, https://github.com/moonshotai
Notice:Human's prompt, Datasets by Gemini-2.5-Pro-DeepResearch