
MathyAIwithMike
Many AI leaderboards rely on LLMs to judge other LLMs, but new research reveals this method is statistically flawed, leading to biased performance estimates. The 'naive accuracy' reported is skewed by the judge's sensitivity and specificity, causing good models to appear worse and vice versa. To fix this, the paper adapts the Rogan-Gladen estimator from epidemiology, correcting for the judge's errors. It also introduces a novel approach to confidence intervals, accounting for both test set and calibration set variance. Adaptive allocation of human annotations further optimizes the process, focusing resources where the judge is 'noisier' to maximize precision. Simulations validate the framework, demonstrating unbiased results and reliable confidence intervals.