One of the biggest mistakes in AI evals is treating them as objective truth.
Benchmarks and leaderboards are a great signal, but they are not universally applicable.
Think SATs or job interviews: directionally correct, but not a guarantee of on-the-job performance.
Results depend on context, environment, operations, and collaboration.
And just like with people, the more you actually work with a particular LLM, the more results start to compound.