Arena-Hard Pipeline: Live Data to Benchmarks

2024 April, 21

Source Link

LMSYS has introduced a groundbreaking benchmark known as Arena-Hard, designed to revolutionize the evaluation of Large Language Models (LLMs) across 500 real-world scenarios automatically. Arena-Hard achieves an impressive 89% matching rate with human preferences from the LMSYS chatbot arena, leveraging LLM-as-a-Judge technology. Key features of Arena-Hard include: Outperforming benchmarks like MT-Bench and AlpacaEval. Demonstrating an 89.1% agreement with human preferences from the Chatbot Arena (LMSYS). Utilizing LLM-as-a-Judge with models such as OpenAI GPT-4 Turbo or Anthropic Claude 3 Opus. Incorporating 500 high-quality prompts derived from real-world use cases. Addressing position, length, and self biases to ensure comprehensive evaluation. Offering model evaluations at $25 per assessment (for GPT-4 Turbo). Implementing frequent refreshes to mitigate overfitting risks, ensuring the benchmark's reliability and relevance over time. With the introduction of Arena-Hard, LMSYS sets a new standard in LLM evaluation, providing a robust framework for assessing model performance across a diverse range of practical applications.