dynamo支持多级存储管理、kvcache,现在各个存储厂商近期适配宣传适配类似dynamo的技术,所以什么是dynamo,以及如何get started?
本文涵盖Dynamo架构、安装、基础运行、分布式服务和开发流程,实践内容来自于dynamo官方github文档。
一、NVIDIA Dynamo是什么?
Dynamo是NVIDIA面向大语言模型(LLM)多节点多GPU推理部署的开源高性能推理框架:
目标:解决单卡存储/计算瓶颈,使多GPU/多节点像一个大加速器一样协同推理特性:
高吞吐、低时延支持LLM推理引擎(如TRT-LLM、vLLM、SGLang等)高效KV缓存管理与路由动态GPU资源调度
二、Dynamo架构与关键功能
分离 Prefill & Decode 阶段
推理时将“prefill”与“decode”拆开,充分用好GPU资源,提高吞吐。
动态GPU调度
根据请求负载变化,自动分配/回收GPU资源。
KV Cache智能路由
避免重复计算,提升缓存利用率。
加速数据传输(NIXL)
采用高效通信协议,降低节点间延迟。
KV Cache分层卸载
支持多级存储(如显存、主存),扩大可服务模型/上下文长度。
多语言实现
Rust(高性能)、Python(易扩展)。
三、安装与环境准备
推荐环境:Ubuntu 24.04,x86_64架构,且NVIDIA GPU驱动和CUDA环境已装好。
1. 系统依赖安装
sudo apt-get update
DEBIAN_FRONTEND=noninteractive sudo apt-get install -yq python3-dev python3-pip python3-venv libucx0
2. Python虚拟环境
python3 -m venv venv
source venv/bin/activate
3. 安装Dynamo及依赖
pip install "ai-dynamo[all]"
4. Dynamo使用
完成第三步就可以使用dynamo命令啦,
dynamo -h
dynamo的使用帮助:
Usage: dynamo [OPTIONS] COMMAND [ARGS]...
The Dynamo CLI is a CLI for serving, containerizing, and deploying Dynamo applications. It takes inspiration from and leverages core pieces of the BentoML deployment stack.
At a high level, you use `serve` to run a set of dynamo services locally, `build` and `containerize` to package them up for deployment, and then `cloud` and `deploy` to deploy them to a K8s cluster
running the Dynamo Cloud
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --version -v Show the application version and exit. │
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or customize the installation. │
│ --help -h Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ env Display information about the current environment. │
│ serve Locally serve a Dynamo pipeline. │
│ run Execute dynamo-run with any additional arguments. │
│ deploy Deploy a Dynamo pipeline (same as deployment create). │
│ build Packages Dynamo service for deployment. Optionally builds a docker container. │
│ get Display Dynamo pipeline details. │
│ deployment Deploy Dynamo applications to Dynamo Cloud Kubernetes Platform │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
四、本地运行和交互
1. 直接用dynamo run推理(未调通)
官方教程以vLLM后端和HuggingFace模型为例:
dynamo run out=vllm deepseek-ai/DeepSeek-R1-Distill-Llama-8B
将自动从huggingface下载模型文件,默认保存路径为~/.cache/huggingface
..el-00002-of-000002.safetensors [00:44:58] [████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 6.89 GiB/6.89 GiB 3.41 MiB/s (0s)
model.safetensors.index.json [00:00:02] [█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 23.67 KiB/23.67 KiB 15.62 KiB/s (0s)
tokenizer.json [00:00:04] [██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 8.66 MiB/8.66 MiB 3.38 MiB/s (0s)
tokenizer_config.json [00:00:01] [███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 3.00 KiB/3.00 KiB 6.09 KiB/s (0s)
2025-07-01T10:59:54.703Z INFO dynamo_run::input::common: Waiting for remote model..
卡在这里很久,没有反应,不知道是不是哪个配置没有整对,不过还好分布式推理服务调通了…
2. 启动分布式推理服务
启动监控/指标服务
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo
docker compose -f deploy/metrics/docker-compose.yml up -d
启动如下基础容器:
[+] Running 4/4
✔ etcd-server Pulled 4.7s
✔ nats-server Pulled 12.5s
✔ 6c51dc8c9584 Pull complete 4.8s
✔ ea3c38ba4a87 Pull complete
启动推理服务
./container/build.sh --framework vllm # 构建镜像
./container/run.sh -it --framework vllm # 运行镜像,可运行-h 查看可选参数,修改挂载路径
# 容器中执行
cd examples/llm
vim configs/agg.yaml # 修改配置文件,将model改为本地对应的路径
dynamo serve graphs.agg:Frontend -f configs/agg.yaml
API请求示例
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"stream": false,
"max_tokens": 300
}' | jq
response:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1122 100 949 100 173 328 59 0:00:02 0:00:02 --:--:-- 387
{
"id": "fbf93e67-1b37-4c1e-ae3a-2d2e8c1d26de",
"choices": [
{
"index": 0,
"message": {
"content": "Alright, the user greeted me with \"Hello, how are you?\" and I responded by saying I'm just a program, so I don't have feelings. Now, I need to figure out what they might be asking or need help with. Maybe they're looking for assistance with something specific, or they just wanted to start a conversation. I should keep the response friendly and open, encouraging them to ask anything they need help with. I'll make sure to keep it natural and not too robotic.\n\n\nI'm just a program, so I don't have feelings, but thanks for asking! How can I assist you today?",
"refusal": null,
"tool_calls": null,
"role": "assistant",
"function_call": null,
"audio": null
},
"finish_reason": "stop",
"logprobs": null
}
],
"created": 1751426570,
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"service_tier": null,
"system_fingerprint": null,
"object": "chat.completion",
"usage": null
从日志中可以看到token生成速率,我只用了一张卡,所以速率比较低?40tokens/s
2025-07-02T03:30:24.146Z INFO worker.generate: [VllmWorker:1] Prefilling locally for request 7377f732-6e38-4435-a26a-783b57f304b4 with length 193
2025-07-02T03:30:24.147Z INFO engine._handle_process_request: Added request 7377f732-6e38-4435-a26a-783b57f304b4.
2025-07-02T03:30:27.525Z INFO metrics.log: Avg prompt throughput: 38.5 tokens/s, Avg generation throughput: 30.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
2025-07-02T03:30:27.525Z INFO metrics.log: Prefix cache hit rate: GPU: 80.00%, CPU: 0.00%
2025-07-02T03:30:32.536Z INFO metrics.log: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 44.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
2025-07-02T03:30:32.536Z INFO metrics.log: Prefix cache hit rate: GPU: 80.00%, CPU: 0.00%
2025-07-02T03:30:37.539Z INFO metrics.log: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 44.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.
2025-07-02T03:30:37.539Z INFO metrics.log: Prefix cache hit rate: GPU: 80.00%, CPU: 0.00%
Dynamo的部署还是比较简单的,有卡的朋友们玩起来哦~ 多多交流!
!