觉醒之路

dynamo支持多级存储管理、kvcache,现在各个存储厂商近期适配宣传适配类似dynamo的技术,所以什么是dynamo,以及如何get started?

本文涵盖Dynamo架构、安装、基础运行、分布式服务和开发流程,实践内容来自于dynamo官方github文档。

一、NVIDIA Dynamo是什么?

Dynamo是NVIDIA面向大语言模型(LLM)多节点多GPU推理部署的开源高性能推理框架:

目标:解决单卡存储/计算瓶颈,使多GPU/多节点像一个大加速器一样协同推理特性:

高吞吐、低时延支持LLM推理引擎(如TRT-LLM、vLLM、SGLang等)高效KV缓存管理与路由动态GPU资源调度

二、Dynamo架构与关键功能

分离 Prefill & Decode 阶段

推理时将“prefill”与“decode”拆开,充分用好GPU资源,提高吞吐。

动态GPU调度

根据请求负载变化,自动分配/回收GPU资源。

KV Cache智能路由

避免重复计算,提升缓存利用率。

加速数据传输(NIXL)

采用高效通信协议,降低节点间延迟。

KV Cache分层卸载

支持多级存储(如显存、主存),扩大可服务模型/上下文长度。

多语言实现

Rust(高性能)、Python(易扩展)。

三、安装与环境准备

推荐环境:Ubuntu 24.04,x86_64架构,且NVIDIA GPU驱动和CUDA环境已装好。

1. 系统依赖安装

sudo apt-get update

DEBIAN_FRONTEND=noninteractive sudo apt-get install -yq python3-dev python3-pip python3-venv libucx0

2. Python虚拟环境

python3 -m venv venv

source venv/bin/activate

3. 安装Dynamo及依赖

pip install "ai-dynamo[all]"

4. Dynamo使用

完成第三步就可以使用dynamo命令啦,

dynamo -h

dynamo的使用帮助:

Usage: dynamo [OPTIONS] COMMAND [ARGS]...

The Dynamo CLI is a CLI for serving, containerizing, and deploying Dynamo applications. It takes inspiration from and leverages core pieces of the BentoML deployment stack.

At a high level, you use `serve` to run a set of dynamo services locally, `build` and `containerize` to package them up for deployment, and then `cloud` and `deploy` to deploy them to a K8s cluster

running the Dynamo Cloud

╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮

│ --version -v Show the application version and exit. │

│ --install-completion Install completion for the current shell. │

│ --show-completion Show completion for the current shell, to copy it or customize the installation. │

│ --help -h Show this message and exit. │

╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

╭─ Commands ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮

│ env Display information about the current environment. │

│ serve Locally serve a Dynamo pipeline. │

│ run Execute dynamo-run with any additional arguments. │

│ deploy Deploy a Dynamo pipeline (same as deployment create). │

│ build Packages Dynamo service for deployment. Optionally builds a docker container. │

│ get Display Dynamo pipeline details. │

│ deployment Deploy Dynamo applications to Dynamo Cloud Kubernetes Platform │

╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

四、本地运行和交互

1. 直接用dynamo run推理(未调通)

官方教程以vLLM后端和HuggingFace模型为例:

dynamo run out=vllm deepseek-ai/DeepSeek-R1-Distill-Llama-8B

将自动从huggingface下载模型文件,默认保存路径为~/.cache/huggingface

..el-00002-of-000002.safetensors [00:44:58] [████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 6.89 GiB/6.89 GiB 3.41 MiB/s (0s)

model.safetensors.index.json [00:00:02] [█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 23.67 KiB/23.67 KiB 15.62 KiB/s (0s)

tokenizer.json [00:00:04] [██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 8.66 MiB/8.66 MiB 3.38 MiB/s (0s)

tokenizer_config.json [00:00:01] [███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 3.00 KiB/3.00 KiB 6.09 KiB/s (0s)

2025-07-01T10:59:54.703Z INFO dynamo_run::input::common: Waiting for remote model..

卡在这里很久,没有反应,不知道是不是哪个配置没有整对,不过还好分布式推理服务调通了…

2. 启动分布式推理服务

启动监控/指标服务

git clone https://github.com/ai-dynamo/dynamo.git

cd dynamo

docker compose -f deploy/metrics/docker-compose.yml up -d

启动如下基础容器:

[+] Running 4/4

✔ etcd-server Pulled 4.7s

✔ nats-server Pulled 12.5s

✔ 6c51dc8c9584 Pull complete 4.8s

✔ ea3c38ba4a87 Pull complete

启动推理服务

./container/build.sh --framework vllm # 构建镜像

./container/run.sh -it --framework vllm # 运行镜像,可运行-h 查看可选参数,修改挂载路径

# 容器中执行

cd examples/llm

vim configs/agg.yaml # 修改配置文件,将model改为本地对应的路径

dynamo serve graphs.agg:Frontend -f configs/agg.yaml

API请求示例

curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{

"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",

"messages": [{"role": "user", "content": "Hello, how are you?"}],

"stream": false,

"max_tokens": 300

}' | jq

response:

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

100 1122 100 949 100 173 328 59 0:00:02 0:00:02 --:--:-- 387

{

"id": "fbf93e67-1b37-4c1e-ae3a-2d2e8c1d26de",

"choices": [

{

"index": 0,

"message": {

"content": "Alright, the user greeted me with \"Hello, how are you?\" and I responded by saying I'm just a program, so I don't have feelings. Now, I need to figure out what they might be asking or need help with. Maybe they're looking for assistance with something specific, or they just wanted to start a conversation. I should keep the response friendly and open, encouraging them to ask anything they need help with. I'll make sure to keep it natural and not too robotic.\n\n\nI'm just a program, so I don't have feelings, but thanks for asking! How can I assist you today?",

"refusal": null,

"tool_calls": null,

"role": "assistant",

"function_call": null,

"audio": null

},

"finish_reason": "stop",

"logprobs": null

}

],

"created": 1751426570,

"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",

"service_tier": null,

"system_fingerprint": null,

"object": "chat.completion",

"usage": null

从日志中可以看到token生成速率,我只用了一张卡,所以速率比较低?40tokens/s

2025-07-02T03:30:24.146Z INFO worker.generate: [VllmWorker:1] Prefilling locally for request 7377f732-6e38-4435-a26a-783b57f304b4 with length 193

2025-07-02T03:30:24.147Z INFO engine._handle_process_request: Added request 7377f732-6e38-4435-a26a-783b57f304b4.

2025-07-02T03:30:27.525Z INFO metrics.log: Avg prompt throughput: 38.5 tokens/s, Avg generation throughput: 30.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.

2025-07-02T03:30:27.525Z INFO metrics.log: Prefix cache hit rate: GPU: 80.00%, CPU: 0.00%

2025-07-02T03:30:32.536Z INFO metrics.log: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 44.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.

2025-07-02T03:30:32.536Z INFO metrics.log: Prefix cache hit rate: GPU: 80.00%, CPU: 0.00%

2025-07-02T03:30:37.539Z INFO metrics.log: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 44.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.

2025-07-02T03:30:37.539Z INFO metrics.log: Prefix cache hit rate: GPU: 80.00%, CPU: 0.00%

Dynamo的部署还是比较简单的,有卡的朋友们玩起来哦~ 多多交流!