RSS Google AI Blog
Follow
Building better AI benchmarks: How many raters are enough?
Reproducibility in machine learning is crucial for building trust and enabling cumulative progress. However, human ground truth data introduces challenges due to inherent disagreement. Current AI benchmarking often overlooks this human variation, partly due to the high cost of collecting data from multiple raters. A study investigated the trade-off between rating many items with few raters versus rating fewer items with many raters. Historically, AI evaluation has favored the "forest" approach, using only a few raters per item, which is often insufficient for capturing nuanced human opinion. To address this, a simulator was developed to stress-test various scales of items and numbers of raters within a fixed budget. This simulation used diverse, real-world datasets involving subjective tasks like toxicity detection. The key findings challenge the standard practice of using only 3-5 raters per item, suggesting that more than 10 are often needed for reliable results. The optimal strategy depends on the metric: breadth (more items) is better for majority votes, while depth (more raters) is necessary for capturing opinion variation. Efficient reproducibility is achievable with a modest budget by correctly optimizing the ratings-per-item ratio for the chosen metric. This research moves away from a "single truth" paradigm, acknowledging that understanding human disagreement is as vital as agreement for building reliable AI.