StreamBench

About StreamBench

StreamBench is the first benchmark designed to evaluate the continuous improvement capabilities of large language model (LLM) agents over time. In contrast to traditional benchmarks, which focus on static evaluations of LLMs’ capabilities at a snapshot, StreamBench aims at evaluating the ability of LLM agents to learn and improve iteratively through an input-feedback sequence, reflecting real-world deployment scenarios. Currently, StreamBench contains the following tasks: text-to-SQL generation, python programming, tool use, medical diagnosis, and question answering.

News

Sep. 26, 2024 StreamBench is accepted by NeurIPS 2024 Datasets and Benchmarks Track! See you in Vancouver!

How StreamBench Works

1. Input-Feedback Sequence: A schematic diagram showing the streaming setting of StreamBench, where agents update their components (p, r, M, or θ) from an input-feedback sequence to achieve the highest final accuracy. Benchmark users need to design their own algorithms to update components of their language agents, with the goal to maximize the accuracy of the entire sequence.

2. Example Task: Performance curve on the medical diagnosis dataset (DDXPlus) on StreamBench. LLM agents are able to gradually improve with our proposed streaming baselines.

Submission

Please follow the Submission Guideline (work in progress)

Citation

@article{wu2024streambench,
  title={StreamBench: Towards Benchmarking Continuous Improvement of Language Agents},
  author={Wu, Cheng-Kuang and Tam, Zhi Rui and Lin, Chieh-Yen and Chen, Yun-Nung and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2406.08747},
  year={2024}
}

Leaderboard

	Method & LLMs	BIRD	DS1000	ToolBench	DDXPlus	HotpotQA
Oct 31, 2024	Self-StreamICL + gpt-4o Appier AI Research	42.63	59.40	76.27	92.01	67.00
Oct 31, 2024	Self-StreamICL + gemini-1.5-flash Appier AI Research	41.20	52.20	75.07	86.34	65.20
Oct 31, 2024	MAM-StreamICL (gpt-3.5-turbo + gemini-1.0-pro + claude-3-haiku) Appier AI Research	36.68	43.10	75.87	83.50	55.20