Migration Guide: vLLM → sageLLM¶
Date: 2026-01-13 Status: Active Applies to: SAGE v0.3.0+
Overview¶
SAGE is migrating from vLLM (via VLLMGenerator) to sageLLM (via SageLLMGenerator) as the default LLM inference engine. This migration provides:
- Unified Backend Abstraction - Single API for multiple hardware backends (CUDA, Ascend, CPU)
- Simplified Configuration - Consistent parameters across all backends
- Better Testing Support - Built-in mock backend for unit tests
- Hardware Portability - Seamless deployment across different accelerators
Migration Timeline¶
| Phase | Status | Description |
|---|---|---|
| Phase 1 | ✅ Complete | SageLLMGenerator available, vLLM still default |
| Phase 2 | 🚧 Current | sagellm is default, vLLM deprecated with warnings |
| Phase 3 | ⏳ v0.4.0 | VLLMGenerator removed, use sageLLM only |
Quick Migration¶
Before (vLLM)¶
from sage.middleware.operators.llm import VLLMGenerator
generator = VLLMGenerator(
model_name="Qwen/Qwen2.5-7B-Instruct",
base_url="http://localhost:8001/v1",
temperature=0.7,
max_tokens=2048,
)
result = generator.execute("写一首诗")
After (sageLLM) ✅ Recommended¶
from sage.middleware.operators.llm import SageLLMGenerator
generator = SageLLMGenerator(
model_path="Qwen/Qwen2.5-7B-Instruct",
backend_type="auto", # or "cuda", "ascend", "mock"
temperature=0.7,
max_tokens=2048,
)
result = generator.execute("写一首诗")
Configuration Parameter Mapping¶
Core Parameters¶
vLLM (VLLMGenerator) |
sageLLM (SageLLMGenerator) |
Notes |
|---|---|---|
model_name |
model_path |
HuggingFace model ID or local path |
base_url |
(removed) | sageLLM manages engine internally |
api_key |
(removed) | Not needed for local inference |
temperature |
temperature |
Same (default: 0.7) |
max_tokens |
max_tokens |
Same (default: 2048) |
top_p |
top_p |
Same (default: 0.95) |
| (N/A) | top_k |
New: top-k sampling (default: 50) |
| (N/A) | backend_type |
New: "auto"/"cuda"/"ascend"/"mock" |
| (N/A) | device_map |
New: "auto"/"cuda:0"/"cpu" |
| (N/A) | dtype |
New: "auto"/"float16"/"bfloat16" |
New Parameters in sageLLM¶
| Parameter | Type | Default | Description |
|---|---|---|---|
backend_type |
str | "auto" |
Engine backend selection |
model_path |
str | "" |
Model path or HuggingFace ID |
device_map |
str | "auto" |
Device mapping strategy |
dtype |
str | "auto" |
Data type for inference |
max_tokens |
int | 2048 |
Maximum generation tokens |
temperature |
float | 0.7 |
Sampling temperature |
top_p |
float | 0.95 |
Nucleus sampling parameter |
top_k |
int | 50 |
Top-k sampling parameter |
engine_id |
str | "" |
Custom engine identifier |
timeout |
float | 120.0 |
Request timeout (seconds) |
default_options |
dict | {} |
Default generation options |
backend_type Values¶
| Value | Description | Use Case |
|---|---|---|
"auto" |
Auto-detect best available backend | Recommended for production |
"mock" |
Mock backend (no GPU required) | Unit tests, CI/CD |
"cuda" |
NVIDIA CUDA backend | NVIDIA GPU deployment |
"ascend" |
Huawei Ascend NPU backend | Ascend hardware deployment |
engine_type Values (for Operators)¶
When configuring operators like PlanningOperator, TimingOperator, ToolSelectionOperator:
| Value | Generator | Status |
|---|---|---|
"sagellm" |
SageLLMGenerator |
✅ Default, Recommended |
"vllm" |
VLLMGenerator |
⚠️ Deprecated |
"openai" |
OpenAIGenerator |
✅ For OpenAI-compatible APIs |
"hf" |
HFGenerator |
✅ For HuggingFace pipelines |
Migration Examples¶
1. Basic Generator Migration¶
# ❌ OLD (deprecated)
from sage.middleware.operators.llm import VLLMGenerator
generator = VLLMGenerator(
model_name="Qwen/Qwen2.5-7B-Instruct",
base_url="http://localhost:8001/v1",
)
# ✅ NEW (recommended)
from sage.middleware.operators.llm import SageLLMGenerator
generator = SageLLMGenerator(
model_path="Qwen/Qwen2.5-7B-Instruct",
backend_type="auto",
)
2. RAG Pipeline Migration¶
# ❌ OLD (deprecated)
from sage.middleware.operators.rag import SageLLMRAGGenerator
rag = SageLLMRAGGenerator(engine_type="vllm")
# ✅ NEW (recommended)
from sage.middleware.operators.rag import SageLLMRAGGenerator
rag = SageLLMRAGGenerator(
engine_type="sagellm",
backend_type="auto",
)
3. Agentic Operator Migration¶
# ❌ OLD (deprecated)
from sage.middleware.operators.agentic import PlanningOperator
op = PlanningOperator(config={
"engine_type": "vllm",
"model_name": "Qwen/Qwen2.5-7B-Instruct",
"base_url": "http://localhost:8001/v1",
})
# ✅ NEW (recommended)
from sage.middleware.operators.agentic import PlanningOperator
op = PlanningOperator(config={
"engine_type": "sagellm",
"backend_type": "auto",
"model_path": "Qwen/Qwen2.5-7B-Instruct",
})
4. Unit Testing with Mock Backend¶
# ✅ Use mock backend for tests (no GPU required)
from sage.middleware.operators.llm import SageLLMGenerator
generator = SageLLMGenerator(
backend_type="mock",
model_path="mock-model",
)
# Works without actual model or GPU
result = generator.execute("Test prompt")
assert result is not None
5. CLI Commands Migration¶
# ❌ OLD (deprecated)
sage llm model list --engine vllm
# ✅ NEW (recommended)
sage llm model list --engine sagellm
sage llm model list # sagellm is default
Deprecation Warnings¶
When using deprecated vllm engine, you will see warnings:
DeprecationWarning: engine_type='vllm' is deprecated.
Use engine_type='sagellm' with backend_type='cuda' instead.
This will be removed in SAGE v0.4.0.
To suppress warnings during migration:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning, module="sage.middleware")
FAQ¶
Q1: Do I need to run a separate vLLM server?¶
A: No. Unlike VLLMGenerator which requires an external vLLM server, SageLLMGenerator manages the inference engine internally through the EngineFactory. Just specify the model path and backend type.
Q2: How do I choose the right backend_type?¶
A: Use "auto" (default) for most cases. It automatically selects the best available backend:
- If NVIDIA GPU is available → uses
cuda - If Ascend NPU is available → uses
ascend - For testing without hardware → use
"mock"explicitly
Q3: Can I still use OpenAI-compatible APIs?¶
A: Yes. For OpenAI-compatible endpoints (including DashScope, local vLLM servers), use OpenAIGenerator:
from sage.middleware.operators.rag import OpenAIGenerator
generator = OpenAIGenerator(
model_name="qwen-turbo",
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
api_key="sk-xxx",
)
Q4: What if I need VLLMGenerator features not in sageLLM?¶
A: File an issue on GitHub. Most features should be available through SageLLMGenerator with appropriate configuration. For edge cases, VLLMGenerator will remain available (with deprecation warnings) until v0.4.0.
Q5: How do I migrate tests that depend on vLLM?¶
A: Use the mock backend for unit tests:
# Replace any engine_type="vllm" with:
generator = SageLLMGenerator(
backend_type="mock",
model_path="test-model",
)
This eliminates the need for actual GPU hardware in CI/CD pipelines.
Q6: Is there a performance difference?¶
A: SageLLMGenerator with backend_type="cuda" uses the same underlying inference optimizations. For most workloads, performance should be equivalent or better due to improved memory management.
Q7: How do I handle environment variables?¶
A: Environment variable mapping:
| OLD | NEW | Description |
|---|---|---|
VLLM_MODEL |
SAGELLM_MODEL_PATH |
Default model path |
VLLM_BASE_URL |
(removed) | Not needed |
| - | SAGELLM_BACKEND_TYPE |
Default backend |
| - | SAGELLM_DEVICE_MAP |
Default device map |
Troubleshooting¶
Error: "No backend available"¶
Solution: Install the required backend or use mock mode:
# For CUDA
pip install sagellm-backend[cuda]
# Or use mock for testing
generator = SageLLMGenerator(backend_type="mock")
Error: "Model not found"¶
Solution: Ensure model is downloaded or use a valid HuggingFace ID:
generator = SageLLMGenerator(
model_path="Qwen/Qwen2.5-7B-Instruct", # HuggingFace ID
# OR local path
model_path="/path/to/local/model",
)
Error: "CUDA out of memory"¶
Solution: Adjust device mapping or use quantization:
generator = SageLLMGenerator(
model_path="Qwen/Qwen2.5-7B-Instruct",
dtype="float16", # Use half precision
device_map="auto", # Let it manage GPU memory
)
Related Documentation¶
- ISAGELLM Migration - sage-llm-core/gateway to isagellm migration
- API Reference: SageLLMGenerator
- Control Plane Architecture
Support¶
- GitHub Issues: intellistream/SAGE/issues
- Discussions: intellistream/SAGE/discussions