RSS Google AI Blog
Follow
A scalable framework for evaluating health language models
Large language models (LLMs) can analyze complex health data to generate personalized responses. Evaluating these LLM responses is crucial for accuracy and safety, but current human-expert evaluation is costly and not scalable. This paper introduces a new framework for evaluating health LLMs using Adaptive Precise Boolean rubrics. These rubrics break down complex questions into granular, Yes/No criteria to improve consistency and efficiency. The framework was tested in metabolic health and demonstrated significantly higher inter-rater reliability than traditional Likert scales. Adaptive Precise Boolean rubrics also reduced evaluation time by over 50%. This method proved more sensitive to variations in response quality compared to Likert scales. Automating the rubric filtering process with a zero-shot classifier maintained similar evaluation improvements. The framework reliably detected quality drops in LLM responses when real participant data was altered. The proposed approach offers a scalable and streamlined method for LLM evaluation in specialized domains.