SLO-Aware KV Cache Management Work Plan¶
Gap Snapshot (2025-12-02)¶
sageLLM Control Plane (packages/sage-llm-core/src/sage/llm/control_plane/)
- No
memory_manager/package yet;ControlPlaneManageronly tracks queues and GPU reservations without per-request KV/embedding memory budgets. RequestMetadataandExecutionInstancelack fields for predicted KV cache size, active embedding workspace, or SLO slack needed for eviction heuristics.- Missing predictors for output length and embedding load despite references in the research
proposal;
load_predictor.pyonly handles aggregate request counts. - No admission control hook in
_scheduling_loop; requests are scheduled solely by policy outputs without checking available memory. - No quota allocation or per-instance budgeting to partition memory between LLM and embedding jobs;
gpu_manager.pyexposes total reservations but not workload-aware shares. - Eviction relies entirely on backend (vLLM) defaults; there is no SLO-aware block manager wrapper
or telemetry about evictions in
metrics_collector.py. - Monitoring endpoints (
metrics_collector,monitoring) expose throughput/latency only; they do not publish KV cache pressure, predicted vs actual memory, or preemption counts.
Benchmark Control Plane (packages/sage-benchmark/src/sage/benchmark/benchmark_control_plane)
- Mixed workloads exist, but workload generators (
workload.py,hybrid_scheduler/workload.py) cannot synthesize deterministic KV pressure (e.g., long decode bursts + large embedding batches) or annotate expected memory demand. - Metrics collectors lack GPU memory attribution per request type;
gpu_monitor.pycaptures coarse device stats without correlating against LLM vs embedding queue states. - No experiment tracks SLO attainment under memory stress; existing experiments (throughput, latency, mixed ratio) lack knobs for quota policies or admission controllers.
- CLI/runner plumbing has no notion of memory-focused policies (e.g., toggling eviction strategies) and cannot surface KV eviction counts from Control Plane telemetry.
- There is no automated regression harness to validate new memory manager modules or ensure prompts/forecasters stay calibrated.
Sequential Task Groups & Prompts¶
Each group must be completed in order. Tasks inside the same group can run in parallel. Task IDs are unique and referenced by prompts.
Group 1 – Telemetry & Predictors (establish data plumbing)¶
T1 – Memory Telemetry Surfaces¶
- Goal: Extend
RequestMetadata,ExecutionInstance, and monitoring paths so every request and instance reports predicted vs actual KV/embedding memory, SLO slack, and eviction counters. - Key Files:
control_plane/types.py,control_plane/monitoring.py,control_plane/metrics_collector.py,control_plane/executors/*. - Deliverable: Structured dataclasses plus serialization hooks that expose
kv_cache_tokens,embedding_workspace_mb,slo_slack_ms, andeviction_events. - Prompt (T1):
Implement per-request and per-instance memory telemetry across the sageLLM Control Plane. Add explicit fields to
RequestMetadataandExecutionInstancefor predicted/actual KV cache size, embedding workspace consumption, and SLO slack. EnsureMetricsCollectorandmonitoringendpoints expose these metrics and that executors populate them when requests start/finish. Update docstrings/tests accordingly.
T2 – Predictive Signal Providers¶
- Goal: Add dedicated predictors for output length and embedding batch memory under
control_plane/predictors/and expose them through a registry. - Key Files: new
predictors/output_length_predictor.py,predictors/embedding_load_forecaster.py, updates toload_predictor.pyandControlPlaneManagerconstructor. - Deliverable: Interfaces that can return confidence intervals for KV size and embedding reservations, plus unit tests with synthetic traces.
- Prompt (T2):
Create predictor modules that estimate (a) LLM output length / KV cache footprint from prompt features and (b) embedding batch memory from queue state. Place them under
control_plane/predictors/, provide a pluggable registry, expose calibration hooks, and write unit tests covering constant/moving-average baselines and mock ML models. Wire the predictors into the Control Plane so future modules can request forecasts.
Group 2 – Core Memory Manager (admission, quotas, eviction)¶
T3 – Embedding-Aware Admission Controller¶
- Goal: Build
memory_manager/admission_controller.pyimplementing Algorithm 1, invoking predictors plus telemetry before enqueuing LLM jobs. - Key Files: new directory
control_plane/memory_manager/, integration inControlPlaneManager.submit_request/_scheduling_loop. - Deliverable: Configurable controller with defer/reject/preempt decisions, safety margins, and instrumentation hooks.
- Prompt (T3):
Introduce
memory_manager/admission_controller.pythat consumes predictor outputs and live telemetry to decide ADMIT/DEFER/REJECT/PREEMPT for incoming requests. Embed policy parameters (safety ratios, defer thresholds) in a dataclass, expose async hooks for the Control Plane, and log structured reasons for tracing. Cover edge cases (unknown output length, predictor failure) with fallbacks.
T4 – SLO-Driven Quota Allocator¶
- Goal: Implement Algorithm 2 inside
memory_manager/quota_allocator.pyto rebalance LLM vs embedding budgets per instance and system-wide. - Key Files: new allocator module, updates to
ControlPlaneManager,ExecutionInstance,autoscaler.py. - Deliverable: Smooth quota adjustments with urgency scoring and integration tests verifying stability.
- Prompt (T4):
Add a quota allocator that periodically recomputes LLM and embedding memory shares using SLO urgency scores. Provide APIs to query/set per-instance budgets, ensure adjustments obey min/max limits, and persist recent history to damp oscillations. Wire it into the scheduling loop and autoscaler so load spikes trigger quota shifts without thrashing.
T5 – SLO-Aware KV Eviction & Block Manager Hook¶
- Goal: Wrap vLLM block management with a SLO-aware eviction policy (Algorithm 3) and expose eviction telemetry.
- Key Files: new
memory_manager/eviction_policy.py, adapter class (e.g.,SLOAwareBlockManager), updates underservice.pyor executor layer. - Deliverable: Eviction selection logic plus integration tests using mocked block allocations.
- Prompt (T5):
Implement a SLO-aware KV eviction component that scores active requests by slack, priority, and recompute cost, selects victims greedily, and instructs the vLLM block manager wrapper to evict blocks. Ensure the wrapper surfaces eviction events to telemetry, handles near-completion protection, and cleanly falls back if eviction cannot free enough memory.
Group 3 – Control Plane & Policy Integration¶
T6 – Control Plane Wiring & APIs¶
- Goal: Thread the new memory manager through
ControlPlaneManager: lifecycle management, configuration, background loops, and REST/CLI toggles. - Key Files:
control_plane/manager.py,control_plane_service.py, config schemas, CLI glue. - Deliverable: Configurable feature flags plus admin APIs to inspect memory state and override policies.
- Prompt (T6):
Integrate the admission controller, quota allocator, and eviction policy into
ControlPlaneManager. Add configuration options (env + JSON) to enable/disable each module, expose health/metrics endpoints for memory state, and ensure the manager starts/stops background tasks cleanly. Update client/service layers so users can set memory-policy parameters via presets.
T7 – Policy, Router, and Autoscaler Updates¶
- Goal: Teach
HybridSchedulingPolicy, PD router, and autoscaler to consume memory budgets and predicted conflicts. - Key Files:
strategies/hybrid_policy.py,pd_routing.py,autoscaler.py,router.py. - Deliverable: Priority adjustments based on quotas, ability to reserve memory for incoming embedding batches, and autoscaler signals tied to memory strain.
- Prompt (T7):
Modify scheduling and routing layers so embedding batches reserve quota before dispatch, mixed instances respect per-type budgets, and autoscaler decisions incorporate memory pressure (e.g., sustained quota deficits). Include tests showing queue drains when memory frees up and documentation for new policy behaviors.
Group 4 – Benchmark & Evaluation Enhancements¶
T8 – Memory-Stress Workload Generator & Datasets¶
- Goal: Extend
HybridWorkloadGeneratorand base workloads to synthesize configurable KV footprints, embedding bursts, and correlated arrival traces; add sample datasets underdata/model_context/. - Key Files:
benchmark_control_plane/workload.py,hybrid_scheduler/workload.py, dataset README. - Deliverable: YAML-driven workload profiles that specify prompt/response length distributions, embedding batch waves, and SLO envelopes.
- Prompt (T8):
Enhance workload generators so benchmark configs can dictate KV cache pressure (e.g., long decode tails) and embedding surges. Add dataset loaders that annotate expected memory per request, update config schemas, and provide sample YAML profiles plus documentation.
T9 – Memory-Centric Metrics & Telemetry Plumbed Through Benchmarks¶
- Goal: Capture Control Plane memory stats, GPU monitor data, and SLO outcomes in benchmark outputs.
- Key Files:
benchmark_control_plane/client.py,metrics.py,hybrid_scheduler/metrics.py,common/gpu_monitor.py, visualization utilities. - Deliverable: Reports highlighting SLO attainment vs memory usage, eviction counts, quota shifts, with charts for P99 TTFT/TPOT under stress.
- Prompt (T9):
Extend benchmark clients to request the new memory telemetry endpoints, tag each request with predicted vs actual memory, and aggregate results into the metrics collectors. Update GPU monitor summaries to align with Control Plane data, surface everything in JSON reports, and add charts for quota utilization and SLO attainment.
T10 – Dedicated Memory Benchmark Suite & CLI Hooks¶
- Goal: Create
benchmark_control_plane/memory_benchmark/with workload generators, metric evaluators, CLI commands, and documentation that replicate the proposal’s experiments. - Key Files: new package directory,
cli.py,runner.py, docs. - Deliverable: Commands like
sage-cp-bench run --mode memoryplus comparison tooling. - Prompt (T10):
Add a memory-focused benchmark suite featuring scripts/tests that sweep workload ratios, quota configs, and eviction policies. Provide CLI commands to trigger these scenarios, emit CSV/JSON summaries for plotting, and ensure results include derived metrics (SLO attainment, throughput, fairness) required by the research plan.
Group 5 – Validation, CI, and Documentation¶
T11 – Regression Tests & CI Wiring¶
- Goal: Add unit/integration tests plus CI targets that cover predictors, admission decisions, quota convergence, and benchmark CLI smoke tests.
- Key Files:
packages/sage-common/tests,packages/sage-benchmark/tests,tools/pytest.ini, GitHub workflows. - Deliverable: Repeatable test matrix (CPU-friendly mocks) and optional nightly GPU job specs.
- Prompt (T11):
Expand the test suite to cover the new memory manager components (unit tests with mocked predictors/GPU stats) and add integration tests that run the benchmark CLI in mock mode. Update pytest config and CI workflows so these tests run in PRs (CPU) and optionally in nightly GPU pipelines.
T12 – Documentation & Research Artifacts¶
- Goal: Update the research proposal, developer docs, and tutorials with architecture diagrams, configuration guides, and benchmark instructions.
- Key Files:
docs/dev-notes/research_work/*.md,docs/dev-notes/l3-l6/*,docs-public/docs_src. - Deliverable: User-facing guide for enabling SLO-aware memory management plus reproducibility appendix.
- Prompt (T12):
Refresh the documentation to describe the new memory management stack end-to-end. Update the research proposal with implementation details, add developer guides explaining configuration knobs, and provide HOWTOs for running the memory benchmark suite. Include diagrams, config snippets, and troubleshooting tips.