跳转至

Paper 1 (SAGE-Bench) – ICML Writing Prompt

Use the following structured brief to draft a full ICML-style benchmark paper (10 pages + references) and provide complete LaTeX source plus .bib entries. The draft must follow the latest ICML formatting guidelines (two-column layout, abstract ≤ 200 words, intro with numbered contributions).

0. Meta Instructions

  • Conference target: ICML 2026 main track.
  • Tone: Scientific, data-rich, emphasize careful benchmarking and reproducibility.
  • Audience: Researchers working on LLM agents, tool-use evaluation, program-of-thought planning, and responsible deployment.
  • Deliverables:
  • main.tex compatible with icml2025.sty (standard \documentclass{article} + \usepackage{icml2025} scaffold).
  • references.bib with ≥25 entries (cover 2022-2025 benchmarks on tool-use, ReAct/ToolLLM/Gorilla, ACE/ToolBench/API-Bank, evaluation methodology, LLM agent analysis, and robustness).
  • README snippet (compilation instructions + dependency list).
  • Figures saved as fig/*.pdf references inside LaTeX; tables produced with \begin{table} blocks referencing tables/*.tex if needed.

1. Title & Authors

  • Working title: SAGE-Bench: Diagnosing Timing, Planning, and Tool Selection Failures in LLM Agents
  • Authors: Placeholder team (e.g., Alice Zhang, David Romero, Jiaqi Wang, Priya Natarajan, Markus Vogel) spanning academia + industry + open-source lab.

2. Motivation & Problem Statement

Highlight why existing benchmarks (ToolBench, API-Bank, ACE, BFCL) fail to jointly test timing, planning, and tool selection under the same controlled protocol. Stress the following pain points:

  • ReAct, ToolLLM, Gorilla, AutoGPT, Voyager, etc. report single-task metrics, avoiding cross-cutting failure modes.
  • Current evaluations ignore timing ("should I call a tool?"), leading to over-triggering or missed opportunities.
  • Task planning metrics are rarely standardized → difficult to compare hierarchical vs ReAct vs CoT planners.
  • Tool selection datasets lack noise injection, semantic variation, and reliability perturbations.
  • SAGE-Bench bridges these gaps with: (i) unified dataset of ~1k tasks with human-verified references, (ii) layered RQ design (RQ1 timing, RQ2 planning, RQ3 selection), (iii) scripted analysis suites (scaling, robustness, ablations, cross-dataset).

3. Benchmark Overview (Core Sections)

Structure the method section with four subsections (include diagrams / pseudo-code where helpful):

  1. Benchmark Construction

  2. Describe data sourcing: curated tasks spanning enterprise productivity, developer tooling, reasoning, retrieval.

  3. Provide split counts (train/dev/test) and labeling pipeline (human annotation + validator scripts).
  4. Detail challenge definitions: timing messages, planning traces, selection candidate pools (≥20 tools per query), noise tool synthesis, reliability perturbations.

  5. Evaluation Harness

  6. Introduce the adapter registry (selectors, planners, timing deciders) located in packages/sage-benchmark/.../adapter_registry.py.

  7. Emphasize controlled variables: same base models, embeddings, temperature, tool catalogs, latency budget.
  8. Explain reproducible CLI: sage-bench paper1 run --section {5.2|5.3|5.4|5.5} and per-experiment subcommands.

  9. Baselines & Protocols

  10. Timing: rule_based, embedding, llm_based, hybrid.

  11. Planning: simple, hierarchical (HuggingGPT), llm_based CoT, ReAct, Tree-of-Thoughts.
  12. Tool selection: keyword/BM25, dense embedding, hybrid fusion, Gorilla, DFSDT/ToolLLM.
  13. Training (Section 5.5): Standard SFT, LoRA/QLoRA/DoRA, FireAct trajectory tuning, AgentTuning multi-task, ToolLLM fine-tuning (Paper 2’s SIAS methods are not included).
  14. Mention evaluation on both default and "skip LLM" modes.

  15. Analysis Suite

  16. Error breakdown taxonomies (timing FP/FN, planning step missing/order, selection confusion types).

  17. Scaling harness (tool count list = [10,25,50,100,200,500,1000], LLM sizes from Qwen2.5-0.5B → 14B).
  18. Robustness stressors (semantic variation paraphrases, instruction quality levels, tool failure & latency spikes).
  19. Ablations (prompt variants, hybrid weighting, timing pipeline order).
  20. Cross-dataset evaluation on ACE-Bench, ToolBench, API-Bank, BFCL (train on SAGE-Bench, zero-shot test elsewhere).

4. Experimental Setup (Section 5.1 Template)

Require explicit subsections:

  • Datasets & Challenges: Provide a table with counts per challenge + cross-dataset stats; mention candidate tool pool sizes and reliability annotations.
  • Baselines: Summaries + citations for each method (ReAct, ToolLLM, Gorilla, Voyager, Reflexion, AutoGPT for reference, plus SFT/LoRA variants for training comparison).
  • Metrics: list per challenge (Accuracy, Precision/Recall/F1, Plan Success Rate, Step Accuracy, Tool Coverage, Top-K Accuracy@5, MRR, Latency); for analysis mention adaptation lag, forgetting, robustness deltas.
  • Implementation Details: unify on Qwen2.5-7B/14B via vLLM, embeddings = BAAI/bge-m3, hardware = 4×A100 80GB (benchmarks) + reproducible seeds, CLI command schedule.

5. Main Results (Section 5.2 RQ1–RQ3)

Provide templates for describing each RQ:

  • RQ1 Timing: Table summarizing accuracy/precision/recall/F1/latency; highlight hybrid surpassing 95% accuracy, analyze FP vs FN trade-offs.
  • RQ2 Planning: Table + figure comparing success rate, step accuracy; observe ReAct vs ToT vs hierarchical differences, report average plan length.
  • RQ3 Tool Selection: Top-K accuracy curves, MRR table, latency plot; show hybrid vs Gorilla vs DFSDT performance under noise. Include textual guidance for referencing figure files fig1_timing.pdf, fig2_planning.pdf, fig3_selection.pdf and tables table_timing.tex, etc.

6. Analysis & Cross-Dataset Sections

Outline what to emphasize in Sections 5.3–5.5:

  • Error Analysis: Provide narrative on cascading failures and recommended mitigations.
  • Scaling Analysis: Discuss saturation regimes, emergent gains when moving from 1.5B → 7B models, diminishing returns beyond 200 tools.
  • Robustness: Report semantic variation sensitivity (\<5% drop target) and instruction quality gaps; note reliability tolerance thresholds.
  • Ablations: Quantify prompt impacts (+2.3% with CoT) and hybrid weight sweeps.
  • Cross-Dataset Generalization (5.4): Table comparing SAGE-Bench-trained models on ACE, ToolBench, API-Bank, BFCL; emphasize zero-shot gap and how timing/planning improvements transfer to selection-only benchmarks.
  • Training Comparison (5.5): Provide figure/table showing baseline SFT vs LoRA/QLoRA/DoRA vs FireAct vs AgentTuning vs ToolLLM; highlight fairness (same data size) and note that SIAS methods are reserved for Paper 2.

7. Key Findings to Highlight

Ensure the narrative surfaces these quantitative bullets:

  • Hybrid timing reaches 95.8% accuracy with 30% fewer FP than LLM-only.
  • Planning: ReAct improves success by +6.4% over hierarchical on long-horizon tasks; Tree-of-Thoughts yields highest step accuracy but +18% latency.
  • Tool selection: Hybrid fusion beats Gorilla by +3.1% Top-5 under 500-tool setting; DFSDT best under small tool sets but scales poorly.
  • Scaling: Accuracy plateaus beyond 7B models; tool count >200 drastically hurts lexical methods.
  • Robustness: Semantic paraphrases cost ≤2% for embedding methods but ≥7% for keyword baseline.
  • Cross-dataset: Training on SAGE-Bench transfers, improving ACE-Bench top-5 by +4.5% vs prior SOTA.
  • Training comparison: LoRA/QLoRA retain 97% of full SFT accuracy with 38% compute savings; FireAct helps planning but hurts timing; AgentTuning boosts overall benchmark score by +2.8%.

8. Figures & Tables Checklist

Mandate at least these assets:

  • Fig 1 Timing comparison (bar + latency inset).
  • Fig 2 Planning success & plan length.
  • Fig 3 Tool selection Top-5 accuracy vs tool count.
  • Fig 4 Error cascade Sankey / stacked bars.
  • Fig 5 Scaling curves (tool count & LLM size).
  • Fig 6 Robustness heatmap (semantic × reliability).
  • Fig 7 Ablation radar chart.
  • Fig 8 Cross-dataset transfer plot.
  • Table 1 Benchmark summary + statistics.
  • Table 2 RQ1 metrics.
  • Table 3 RQ2 metrics.
  • Table 4 RQ3 metrics.
  • Table 5 Robustness / scaling results.
  • Table 6 Training method comparison (Section 5.5).

9. Discussion Prompts

Provide bullet prompts for the discussion section:

  • Benchmark takeaways: where agents still fail despite large models.
  • Trade-offs between timing conservatism vs latency budgets.
  • Limitations: curated dataset bias, reliance on Qwen family, remaining gap to closed-source LLMs.
  • Ethical considerations: evaluating real tools safely, logging policies, aligning agent autonomy with compliance.
  • Future work: integrate SIAS (Paper 2) improvements, expand to multi-modal instructions, introduce human-in-the-loop verification.

10. Appendix Requirements

  • Full dataset curation pipeline, annotation UI screenshots.
  • Additional per-dataset breakdown tables (ACE, ToolBench, API-Bank, BFCL).
  • Extended error taxonomies + qualitative failure cases.
  • Reproducibility checklist (seeds, versions, CLI commands, hardware cost analysis).
  • License + release plan for SAGE-Bench artifacts.

Output format reminder: Provide main.tex, references.bib, README.md, and figure/table placeholders. All citations in LaTeX must match BibTeX keys; ensure \icmltitlerunning{SAGE-Bench} is set.