The LLM Evaluation guidebook ⚖️

If you've ever wondered how to make sure an LLM performs well on your specific task, this guide is for you! It covers the different ways you can evaluate a model, guides on designing your own evaluations, and tips and tricks from practical experience.

Whether working with production models, a researcher or a hobbyist, I hope you'll find what you need; and if not, open an issue (to suggest ameliorations or missing resources) and I'll complete the guide!

How to read this guide

Beginner user: If you don't know anything about evaluation, you should start by the Basics sections in each chapter before diving deeper. You'll also find explanations to support you about important LLM topics in General knowledge: for example, how model inference works and what tokenization is.
Advanced user: The more practical sections are the Tips and Tricks ones, and Troubleshooting chapter. You'll also find interesting things in the Designing sections.

In text, links prefixed by ⭐ are links I really enjoyed and recommend reading.

Planned next articles

contents/Automated benchmarks/Metrics -> Description of automatic metrics
contents/Introduction: Why do we need to do evaluation?
contents/Thinkg about evaluation: What are the high level things you always need to consider when building your task?
contents/Troubleshooting/Troubleshooting ranking: Why comparing models is hard

Resources

Links I like

Thanks

This guide has been heavily inspired by the ML Engineering Guidebook by Stas Bekman! Thanks for this cool resource!

Many thanks also to all the people who inspired this guide through discussions either at events or online, notably and not limited to:

🤝 Luca Soldaini, Kyle Lo and Ian Magnusson (Allen AI), Max Bartolo (Cohere), Kai Wu (Meta), Swyx and Alessio Fanelli (Latent Space Podcast), Hailey Schoelkopf (EleutherAI), Martin Signoux (Open AI), Moritz Hardt (Max Planck Institute), Ludwig Schmidt (Anthropic)
🔥 community users of the Open LLM Leaderboard and lighteval, who often raised very interesting points in discussions
🤗 people at Hugging Face, like Lewis Tunstall, Omar Sanseviero, Arthur Zucker, Hynek Kydlíček, Guilherme Penedo and Thom Wolf,
of course my team ❤️ doing evaluation and leaderboards, Nathan Habib and Alina Lozovskaya.

Citation

@misc{fourrier2024evaluation,
  author = {Fourrier, Clémentine},
  title = {LLM Evaluation Guidebook},
  year = {2024},
  journal = {GitHub repository},
  url = {https://github.com/huggingface/evaluation-guidebook)
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
assets		assets
contents		contents
resources		resources
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The LLM Evaluation guidebook ⚖️

How to read this guide

Table of contents

Automatic benchmarks

Human evaluation

LLM-as-a-judge

Troubleshooting

General knowledge

Planned next articles

Resources

Thanks

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

The LLM Evaluation guidebook ⚖️

How to read this guide

Table of contents

Automatic benchmarks

Human evaluation

LLM-as-a-judge

Troubleshooting

General knowledge

Planned next articles

Resources

Thanks

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages