Paper 2 (SAGE-Agent) – ICML Writing Prompt¶

Use the following structured brief to draft a full ICML-style paper (10 pages + references) and provide complete LaTeX source plus .bib entries. The draft must follow the latest ICML formatting guidelines (two-column layout, abstract ≤ 200 words, intro with numbered contributions).

0. Meta Instructions¶

Conference target: ICML 2026 main track.
Tone: Technical, precise, mix of theoretical insight + empirical rigor.
Audience: Researchers in LLM agents, continual learning, streaming optimization.
Deliverables:
main.tex compatible with icml2025.sty (use \documentclass{article} + \usepackage{icml2025} pattern).
references.bib with ≥20 entries (include 2023-2025 works on streaming learning, agentic LLMs, reflexion, multi-agent coordination, RLHF/RLAIF, continual learning, coreset selection, tool-use benchmarks).
README snippet describing how to compile.

1. Title & Authors¶

Working title: SIAS: Self-Improving Agentic Systems with Reflective Memory and Streaming Adaptation
Authors: Placeholder names (e.g., Alice Zhang, Bo Chen, etc.) across institutions in US/CN/EU.

2. Motivation¶

Existing tool-augmented LLM agents (e.g., ReAct, ToolLLM, Gorilla, AutoGPT) follow a static plan + best-of-one reasoning paradigm.
They fail in long-horizon tasks because they lack persistence (no memory), lack adaptability (no re-planning when tools fail), and relearn from scratch when new tools/data arrive.
Benchmark evidence from Paper 1 (SAGE-Bench) shows timing/planning/tool-selection accuracy plateaus even when models scale.
Need an agent that learns continuously from experience while remaining sample-efficient.

3. Method Overview (Core Sections)¶

Describe four tightly coupled innovations:

Reflective Memory Planner
Maintains a streaming execution memory of (query, plan, execution trace, reflection) tuples.
Uses SSIS (Streaming Sample Importance Scorer) to rank past episodes.
Performs self-critique before dispatching plans; integrates prior failures to avoid repetition.
Adaptive Execution & Dynamic Re-Planning
Pre-condition verifier + observation validator.
If a tool call deviates from expectations, triggers localized re-planning instead of restarting entire workflow.
Supports multi-resolution planning (high-level vs low-level steps), connecting to runtime/orchestrator.py implementation.
Collaborative Specialist Agents
Router assigns sub-tasks to specialized agents (Researcher, Coder, Analyst, Coordinator).
Agents share summaries through a streaming blackboard; only Coordinator can commit final answers.
Streaming Training Stack (Paper 2 focus)
Unified trainer covering batch, streaming, and continual regimes (ref: streaming_trainer.py, batch_trainer.py).
Incorporates coreset selectors: random, uncertainty, diversity, SSIS.
Includes continual-learning defenses (EWC-style regularizer + replay buffer) and RL fine-tuning (PPO/DPO) on execution feedback.

Provide algorithm boxes (pseudo-code) for each component.

4. Experimental Setup¶

Use these subsections and concrete details:

Benchmarks: SAGE-Bench (timing/planning/tool-selection), Streaming Tool-Arena (new dataset with 1200 tools), Continual Tool Drop-In (phase I=1000 tools, phase II= +200 unseen tools).
Baselines: ReAct, Reflexion, AutoGPT, ToolLLM, Gorilla, LATS, Voyager, and our ablations (w/o memory, w/o replan, w/o coreset, batch-only).
Metrics: plan accuracy, tool success rate, adaptation lag (episodes to recover), catastrophic forgetting (Δ accuracy old tools), sample efficiency (% data to reach baseline), compute hours.
Implementation: UnifiedInferenceClient + vLLM (Qwen2.5-7B/14B), embeddings (BAAI/bge-m3), training on 2×A100 80GB.
Hyperparameters: streaming buffer size 1000, coreset size 256, RL fine-tuning 3 epochs with reward shaping from execution success.

5. Key Results to Highlight¶

Streaming vs Batch: SIAS reaches target accuracy using 35% fewer samples and 28% less GPU time.
Coreset Ablation: SSIS yields +3.4% accuracy vs random, +2.1% vs uncertainty.
Continual Learning: Forgetting reduced from −12.7% to −3.1%; adaptation to new tools in 15 episodes vs 48 for baselines.
RL Fine-tuning: +4.6% end-to-end task success (especially for multi-hop reasoning tasks).
Qualitative case study: planner catches hallucinated weather API result, re-plans via search + summarization.

6. Figures & Tables to Produce¶

Fig 1: Architecture diagram of SIAS pipeline (memory, planner, verifier, specialist agents, streaming trainer).
Fig 2: Sample efficiency curves (streaming vs batch).
Fig 3: Coreset strategy comparison.
Fig 4: Continual-learning forgetting curves.
Fig 5: Ablation radar chart (memory, replan, collaboration, RL).
Table 1: Benchmark comparison on SAGE-Bench (timing/planning/selection accuracy).
Table 2: Streaming vs batch resource usage.
Table 3: Continual-learning results (old/new tools).
Table 4: RL fine-tuning improvements per task family.

7. Discussion Prompts¶

Analyze trade-offs between memory size vs latency.
Failure modes: when SSIS mis-ranks episodes, when collaboration overhead dominates.
Ethical considerations: logging user/tool traces, privacy in memory.
Future work: integrating symbolic verification, extending to multi-modal agents.

8. Appendix Must-Haves¶

Detailed algorithm derivations for SSIS scoring.
Proof sketch for convergence of streaming trainer under bounded drift.
Extended implementation details (configs, hardware utilization, cost analysis).
Additional ablation on memory warm-start length and RL reward shaping.

Output format reminder: Provide main.tex, references.bib, and README.md (compile instructions). Make sure citations in LaTeX match BibTeX keys.