Prometheus 2: Advancing LLM Evaluation
Prepare to be amazed by the latest breakthrough in LLM evaluation: Prometheus 2. Developed by KAIST AI, this open LLM specializes in assessing other LLMs with a level of accuracy comparable to industry giants like OpenAI GPT-4 and Anthropic Claude 3. How does Prometheus 2 achieve such remarkable feats? It starts with the creation of a comprehensive pairwise ranking dataset, surpassing basic evaluation criteria to encompass nuanced qualities. Then, leveraging Mistral and Mixtral 8x7B, two separate LLMs are trained—one on direct assessment data and the other on pairwise ranking data. The weights of these models are harmonized through linear merging, optimizing α values through rigorous experimentation. The result? Prometheus 2 emerges as a powerhouse, excelling in both direct assessment and pairwise ranking benchmarks. With capabilities for grading and scoring LLM responses, a rich dataset, and outstanding correlations to GPT-4 Turbo, Prometheus 2 sets a new standard for LLM evaluation. Whether for initial training evaluations or ongoing assessments, Prometheus 2 offers unparalleled insights into LLM performance. Ready to explore? Dive into the model and dataset now available on Hugging Face, and witness the future of LLM evaluation unfold before your eyes.