[memo]Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models

Large language models (LLMs) tend to generate hallucinations, which are responses inconsistent with visual inputs. To evaluate hallucination benchmarks, a quality measurement framework is proposed. This framework constructs a new high-quality hallucination benchmark for visual language models (VLMs). Previous works, such as POPE and AMBER, have created benchmarks with non-existent objects and extended yes-no questions to other types of hallucinations. The proposed framework selects six publicly available hallucination benchmarks, including MMHal and GAVIE, which follow psychological test principles. However, these benchmarks exhibit limitations in reliability and validity. Closed-ended benchmarks show obvious shortcomings in test-retest reliability. The proposed framework introduces a quality measurement framework for hallucination benchmarks, addressing the limitations of existing benchmarks. The framework draws inspiration from psychological test reliability. Overall, the proposed framework provides a comprehensive approach to evaluating hallucination benchmarks for VLMs.

dev.to

RSS Hunter

2025-07-04

Create attached notes ...