Paper 1 (SAGE-Bench) – ICML Writing Prompt¶
Use the following structured brief to draft a full ICML-style benchmark paper (10 pages +
references) and provide complete LaTeX source plus .bib entries. The draft must follow the
latest ICML formatting guidelines (two-column layout, abstract ≤ 200 words, intro with numbered
contributions).
0. Meta Instructions¶
- Conference target: ICML 2026 main track.
- Tone: Scientific, data-rich, emphasize careful benchmarking and reproducibility.
- Audience: Researchers working on LLM agents, tool-use evaluation, program-of-thought planning, and responsible deployment.
- Deliverables:
main.texcompatible withicml2025.sty(standard\documentclass{article}+\usepackage{icml2025}scaffold).references.bibwith ≥25 entries (cover 2022-2025 benchmarks on tool-use, ReAct/ToolLLM/Gorilla, ACE/ToolBench/API-Bank, evaluation methodology, LLM agent analysis, and robustness).- README snippet (compilation instructions + dependency list).
- Figures saved as
fig/*.pdfreferences inside LaTeX; tables produced with\begin{table}blocks referencingtables/*.texif needed.
1. Title & Authors¶
- Working title:
SAGE-Bench: Diagnosing Timing, Planning, and Tool Selection Failures in LLM Agents - Authors: Placeholder team (e.g., Alice Zhang, David Romero, Jiaqi Wang, Priya Natarajan, Markus Vogel) spanning academia + industry + open-source lab.
2. Motivation & Problem Statement¶
Highlight why existing benchmarks (ToolBench, API-Bank, ACE, BFCL) fail to jointly test timing, planning, and tool selection under the same controlled protocol. Stress the following pain points:
- ReAct, ToolLLM, Gorilla, AutoGPT, Voyager, etc. report single-task metrics, avoiding cross-cutting failure modes.
- Current evaluations ignore timing ("should I call a tool?"), leading to over-triggering or missed opportunities.
- Task planning metrics are rarely standardized → difficult to compare hierarchical vs ReAct vs CoT planners.
- Tool selection datasets lack noise injection, semantic variation, and reliability perturbations.
- SAGE-Bench bridges these gaps with: (i) unified dataset of ~1k tasks with human-verified references, (ii) layered RQ design (RQ1 timing, RQ2 planning, RQ3 selection), (iii) scripted analysis suites (scaling, robustness, ablations, cross-dataset).
3. Benchmark Overview (Core Sections)¶
Structure the method section with four subsections (include diagrams / pseudo-code where helpful):
-
Benchmark Construction
-
Describe data sourcing: curated tasks spanning enterprise productivity, developer tooling, reasoning, retrieval.
- Provide split counts (train/dev/test) and labeling pipeline (human annotation + validator scripts).
-
Detail challenge definitions: timing messages, planning traces, selection candidate pools (≥20 tools per query), noise tool synthesis, reliability perturbations.
-
Evaluation Harness
-
Introduce the adapter registry (selectors, planners, timing deciders) located in
packages/sage-benchmark/.../adapter_registry.py. - Emphasize controlled variables: same base models, embeddings, temperature, tool catalogs, latency budget.
-
Explain reproducible CLI:
sage-bench paper1 run --section {5.2|5.3|5.4|5.5}and per-experiment subcommands. -
Baselines & Protocols
-
Timing: rule_based, embedding, llm_based, hybrid.
- Planning: simple, hierarchical (HuggingGPT), llm_based CoT, ReAct, Tree-of-Thoughts.
- Tool selection: keyword/BM25, dense embedding, hybrid fusion, Gorilla, DFSDT/ToolLLM.
- Training (Section 5.5): Standard SFT, LoRA/QLoRA/DoRA, FireAct trajectory tuning, AgentTuning multi-task, ToolLLM fine-tuning (Paper 2’s SIAS methods are not included).
-
Mention evaluation on both default and "skip LLM" modes.
-
Analysis Suite
-
Error breakdown taxonomies (timing FP/FN, planning step missing/order, selection confusion types).
- Scaling harness (tool count list = [10,25,50,100,200,500,1000], LLM sizes from Qwen2.5-0.5B → 14B).
- Robustness stressors (semantic variation paraphrases, instruction quality levels, tool failure & latency spikes).
- Ablations (prompt variants, hybrid weighting, timing pipeline order).
- Cross-dataset evaluation on ACE-Bench, ToolBench, API-Bank, BFCL (train on SAGE-Bench, zero-shot test elsewhere).
4. Experimental Setup (Section 5.1 Template)¶
Require explicit subsections:
- Datasets & Challenges: Provide a table with counts per challenge + cross-dataset stats; mention candidate tool pool sizes and reliability annotations.
- Baselines: Summaries + citations for each method (ReAct, ToolLLM, Gorilla, Voyager, Reflexion, AutoGPT for reference, plus SFT/LoRA variants for training comparison).
- Metrics: list per challenge (Accuracy, Precision/Recall/F1, Plan Success Rate, Step Accuracy, Tool Coverage, Top-K Accuracy@5, MRR, Latency); for analysis mention adaptation lag, forgetting, robustness deltas.
- Implementation Details: unify on Qwen2.5-7B/14B via vLLM, embeddings = BAAI/bge-m3, hardware = 4×A100 80GB (benchmarks) + reproducible seeds, CLI command schedule.
5. Main Results (Section 5.2 RQ1–RQ3)¶
Provide templates for describing each RQ:
- RQ1 Timing: Table summarizing accuracy/precision/recall/F1/latency; highlight hybrid surpassing 95% accuracy, analyze FP vs FN trade-offs.
- RQ2 Planning: Table + figure comparing success rate, step accuracy; observe ReAct vs ToT vs hierarchical differences, report average plan length.
- RQ3 Tool Selection: Top-K accuracy curves, MRR table, latency plot; show hybrid vs Gorilla vs
DFSDT performance under noise. Include textual guidance for referencing figure files
fig1_timing.pdf,fig2_planning.pdf,fig3_selection.pdfand tablestable_timing.tex, etc.
6. Analysis & Cross-Dataset Sections¶
Outline what to emphasize in Sections 5.3–5.5:
- Error Analysis: Provide narrative on cascading failures and recommended mitigations.
- Scaling Analysis: Discuss saturation regimes, emergent gains when moving from 1.5B → 7B models, diminishing returns beyond 200 tools.
- Robustness: Report semantic variation sensitivity (\<5% drop target) and instruction quality gaps; note reliability tolerance thresholds.
- Ablations: Quantify prompt impacts (+2.3% with CoT) and hybrid weight sweeps.
- Cross-Dataset Generalization (5.4): Table comparing SAGE-Bench-trained models on ACE, ToolBench, API-Bank, BFCL; emphasize zero-shot gap and how timing/planning improvements transfer to selection-only benchmarks.
- Training Comparison (5.5): Provide figure/table showing baseline SFT vs LoRA/QLoRA/DoRA vs FireAct vs AgentTuning vs ToolLLM; highlight fairness (same data size) and note that SIAS methods are reserved for Paper 2.
7. Key Findings to Highlight¶
Ensure the narrative surfaces these quantitative bullets:
- Hybrid timing reaches 95.8% accuracy with 30% fewer FP than LLM-only.
- Planning: ReAct improves success by +6.4% over hierarchical on long-horizon tasks; Tree-of-Thoughts yields highest step accuracy but +18% latency.
- Tool selection: Hybrid fusion beats Gorilla by +3.1% Top-5 under 500-tool setting; DFSDT best under small tool sets but scales poorly.
- Scaling: Accuracy plateaus beyond 7B models; tool count >200 drastically hurts lexical methods.
- Robustness: Semantic paraphrases cost ≤2% for embedding methods but ≥7% for keyword baseline.
- Cross-dataset: Training on SAGE-Bench transfers, improving ACE-Bench top-5 by +4.5% vs prior SOTA.
- Training comparison: LoRA/QLoRA retain 97% of full SFT accuracy with 38% compute savings; FireAct helps planning but hurts timing; AgentTuning boosts overall benchmark score by +2.8%.
8. Figures & Tables Checklist¶
Mandate at least these assets:
- Fig 1 Timing comparison (bar + latency inset).
- Fig 2 Planning success & plan length.
- Fig 3 Tool selection Top-5 accuracy vs tool count.
- Fig 4 Error cascade Sankey / stacked bars.
- Fig 5 Scaling curves (tool count & LLM size).
- Fig 6 Robustness heatmap (semantic × reliability).
- Fig 7 Ablation radar chart.
- Fig 8 Cross-dataset transfer plot.
- Table 1 Benchmark summary + statistics.
- Table 2 RQ1 metrics.
- Table 3 RQ2 metrics.
- Table 4 RQ3 metrics.
- Table 5 Robustness / scaling results.
- Table 6 Training method comparison (Section 5.5).
9. Discussion Prompts¶
Provide bullet prompts for the discussion section:
- Benchmark takeaways: where agents still fail despite large models.
- Trade-offs between timing conservatism vs latency budgets.
- Limitations: curated dataset bias, reliance on Qwen family, remaining gap to closed-source LLMs.
- Ethical considerations: evaluating real tools safely, logging policies, aligning agent autonomy with compliance.
- Future work: integrate SIAS (Paper 2) improvements, expand to multi-modal instructions, introduce human-in-the-loop verification.
10. Appendix Requirements¶
- Full dataset curation pipeline, annotation UI screenshots.
- Additional per-dataset breakdown tables (ACE, ToolBench, API-Bank, BFCL).
- Extended error taxonomies + qualitative failure cases.
- Reproducibility checklist (seeds, versions, CLI commands, hardware cost analysis).
- License + release plan for SAGE-Bench artifacts.
Output format reminder: Provide
main.tex,references.bib,README.md, and figure/table placeholders. All citations in LaTeX must match BibTeX keys; ensure\icmltitlerunning{SAGE-Bench}is set.