Deployment Guide¶
Deploy SAGE applications and the sageLLM服务栈 (LLM / Embedding / Gateway) in a variety of environments.
Quick Start: sage llm serve¶
sage CLI 内置了一键启动/停止 LLM 服务的命令,适合开发和小规模部署:
# 启动默认模型(LLM)
sage llm serve
# 显式指定模型与端口
sage llm serve \
--model Qwen/Qwen2.5-7B-Instruct \
--port 8901
# 同时启动 LLM + Embedding
sage llm serve --with-embedding \
--model Qwen/Qwen2.5-7B-Instruct \
--embedding-model BAAI/bge-m3
# 查看状态 / 日志 / 停止
sage llm status
sage llm logs --follow
sage llm stop
sage llm serve 内部会统一使用 SagePorts,因此严禁在代码中硬编码端口号。相关端口如下:
💡 使用
sage llm model list-remote可以查看官方推荐的常用模型,并结合sage llm model download预热缓存。
| 常量 | 端口 | 用途 |
|---|---|---|
SagePorts.GATEWAY_DEFAULT |
8000 | OpenAI 兼容 Gateway |
SagePorts.LLM_DEFAULT |
8001 | vLLM 推理服务 |
SagePorts.BENCHMARK_LLM |
8901 | WSL2 / Benchmark 备用 |
SagePorts.EMBEDDING_DEFAULT |
8090 | Embedding 服务 |
SagePorts.STUDIO_BACKEND |
8080 | Studio 后端 |
SagePorts.STUDIO_FRONTEND |
5173 | Studio 前端 |
动态引擎管理¶
Control Plane 支持运行时动态启动/停止推理引擎,并自动追踪 GPU 显存:
引擎命令¶
# 列出当前运行的引擎
sage llm engine list
# 启动 LLM 引擎(支持 tensor/pipeline 并行)
sage llm engine start Qwen/Qwen2.5-7B-Instruct --tensor-parallel 2
# 启动 Embedding 引擎
sage llm engine start BAAI/bge-m3 --engine-kind embedding --port 8095
# 停止指定引擎
sage llm engine stop <engine_id>
# 查看 GPU 状态
sage llm gpu
预设编排¶
使用预设一键部署多个引擎组合,避免手动逐个启动:
# 列出内置预设
sage llm preset list
# 查看预设详情
sage llm preset show -n qwen-mini-with-embeddings
# 应用预设(可加 --dry-run 预览)
sage llm preset apply -n qwen-mini-with-embeddings
# 使用自定义 YAML
sage llm preset apply --file ./my-preset.yaml
预设 YAML 示例:
version: 1
name: qwen-mini-with-embeddings
engines:
- name: chat
kind: llm
model: Qwen/Qwen2.5-1.5B-Instruct
tensor_parallel: 1
label: chat-qwen15b
- name: embed
kind: embedding
model: BAAI/bge-small-zh-v1.5
label: embedding-bge
Deploy Individual Services¶
1. LLM 服务(vLLM)¶
SAGE_MODEL="Qwen/Qwen2.5-7B-Instruct"
# 使用 sage llm serve(推荐)
sage llm serve --model "$SAGE_MODEL" --port 8901
# 健康检查
curl http://localhost:8901/v1/models
2. Embedding 服务¶
# 通过 sage llm serve 同时启动
sage llm serve --with-embedding --embedding-model BAAI/bge-m3 --embedding-port 8090
# 或单独启动 Embedding 服务
python -m sage.common.components.sage_embedding.embedding_server \
--model BAAI/bge-m3 \
--port 8090
# 健康检查
curl http://localhost:8090/v1/models
3. 使用客户端¶
from sage.llm import UnifiedInferenceClient
# 自动检测本地服务(推荐)
client = UnifiedInferenceClient.create()
# 或显式配置连接到特定服务
client = UnifiedInferenceClient.create(
control_plane_url="http://localhost:8901/v1",
default_llm_model="Qwen/Qwen2.5-7B-Instruct",
)
Deployment Options¶
1. Local Development¶
For development and testing:
Best for: Development, testing, small-scale experiments
2. Single Server Deployment¶
Deploy on a single machine with multiple workers:
# Use Ray for distributed execution on single machine
export SAGE_EXECUTION_MODE=ray
export RAY_NUM_CPUS=8
python my_app.py
Best for: Medium-scale workloads, production with limited resources
3. Distributed Cluster¶
Deploy across multiple machines:
# On head node
ray start --head --port=6379
# On worker nodes
ray start --address='head_node_ip:6379'
# Run application
export SAGE_EXECUTION_MODE=distributed
python my_app.py
Best for: Large-scale production workloads
4. Kubernetes Deployment¶
Deploy SAGE on Kubernetes:
# sage-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: sage-app
spec:
replicas: 3
selector:
matchLabels:
app: sage
template:
metadata:
labels:
app: sage
spec:
containers:
- name: sage
image: sage:latest
env:
- name: SAGE_EXECUTION_MODE
value: "distributed"
- name: RAY_ADDRESS
value: "ray-head:6379"
Best for: Cloud-native deployments, auto-scaling
5. Docker Container¶
Containerize your SAGE application:
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
# Install SAGE
COPY . /app
RUN pip install -e .
# Run application
CMD ["python", "my_app.py"]
Build and run:
Best for: Reproducible deployments, CI/CD
Configuration¶
Environment Variables¶
# Execution mode
export SAGE_EXECUTION_MODE=local|ray|distributed
# API Keys
export OPENAI_API_KEY=sk-...
export JINA_API_KEY=jina_...
# Ray configuration
export RAY_ADDRESS=localhost:6379
export RAY_NUM_CPUS=8
export RAY_NUM_GPUS=1
# sageLLM stack
export SAGE_CHAT_BASE_URL=http://localhost:8901/v1
export SAGE_EMBEDDING_BASE_URL=http://localhost:8090/v1
export SAGE_UNIFIED_BASE_URL=http://localhost:8000/v1 # Gateway
export SAGE_CHAT_MODEL=Qwen/Qwen2.5-7B-Instruct
export SAGE_EMBEDDING_MODEL=BAAI/bge-m3
# Logging
export SAGE_LOG_LEVEL=INFO
export SAGE_LOG_DIR=./logs
# Performance
export SAGE_MAX_WORKERS=16
export SAGE_BATCH_SIZE=32
Configuration Files¶
Create a .env file:
# .env
SAGE_EXECUTION_MODE=distributed
OPENAI_API_KEY=sk-...
RAY_ADDRESS=ray-cluster:6379
SAGE_LOG_LEVEL=INFO
Load in your application:
Production Considerations¶
1. Monitoring¶
Monitor SAGE applications:
from sage.kernel.api.local_environment import LocalStreamEnvironment
env = LocalStreamEnvironment(
"production_app",
config={"monitoring": {"enabled": True, "metrics_port": 9090, "log_level": "INFO"}},
)
2. Fault Tolerance¶
Enable checkpointing:
env = LocalStreamEnvironment(
"fault_tolerant_app",
config={
"fault_tolerance": {
"strategy": "checkpoint",
"checkpoint_interval": 60.0,
"checkpoint_dir": "/data/checkpoints",
}
},
)
3. Resource Management¶
Configure resources:
env = LocalStreamEnvironment(
"resource_managed_app",
config={
"resources": {"max_workers": 16, "memory_limit": "32GB", "gpu_enabled": True}
},
)
4. Security¶
Secure API keys and credentials:
# Use environment variables
import os
api_key = os.getenv("OPENAI_API_KEY")
# Or use secret management
from sage.common.config import SecretManager
secrets = SecretManager()
api_key = secrets.get("openai_api_key")
Scaling¶
Horizontal Scaling¶
Add more worker nodes:
Vertical Scaling¶
Increase resources per worker:
Cloud Platforms¶
AWS¶
Deploy on AWS using ECS or EKS:
# AWS ECS task definition
{
"family": "sage-app",
"containerDefinitions": [{
"name": "sage",
"image": "sage:latest",
"memory": 8192,
"cpu": 4096,
"environment": [
{"name": "SAGE_EXECUTION_MODE", "value": "distributed"}
]
}]
}
Google Cloud Platform¶
Deploy on GKE:
gcloud container clusters create sage-cluster \
--num-nodes=3 \
--machine-type=n1-standard-4
kubectl apply -f sage-deployment.yaml
Azure¶
Deploy on AKS:
az aks create \
--resource-group sage-rg \
--name sage-cluster \
--node-count 3 \
--node-vm-size Standard_D4s_v3
kubectl apply -f sage-deployment.yaml
Performance Optimization¶
1. Batch Processing¶
2. Parallel Execution¶
3. GPU Acceleration¶
Troubleshooting¶
服务栈常见问题¶
- LLM 端口启动但无法连接(特别是 WSL2):使用
SagePorts.get_recommended_llm_port()或sage llm serve --port 8901。 - Embedding 生成 404:确认
sage llm status显示服务运行中,并使用/v1/embeddings端点。 - Gateway 返回 502:Gateway 无法连接下游 LLM,检查
--llm-port参数是否正确。 - 模型下载缓慢:设置
HF_ENDPOINT=https://hf-mirror.com以使用国内镜像。
Common Issues¶
Ray cluster not connecting:
Out of memory:
Slow performance:
# Enable profiling
config = {"profiling": {"enabled": True}}
# Check bottlenecks
env.get_profiler().print_report()