One of the biggest mistakes in AI evals is treating them as objective truth.

Benchmarks and leaderboards are a great signal, but they are not universally applicable.

Think SATs or job interviews: directionally correct, but not a guarantee of on-the-job performance.

Results depend on context, environment, operations, and collaboration.

And just like with people, the more you actually work with a particular LLM, the more results start to compound.