ConvApparel: Measuring and bridging the realism gap in user simulators

Modern conversational AI can handle complex tasks but struggles with long interactions, often forgetting details or becoming irrelevant. Live human testing for improvement is expensive and difficult to scale. User simulators, powered by LLMs, offer a scalable alternative but often lack realism, exhibiting unusual patience or knowledge. To address this realism gap, a new dataset called ConvApparel has been developed. This dataset consists of human-AI conversations in the apparel shopping domain, collected using a dual-agent protocol. Participants interacted with either a helpful or an intentionally unhelpful AI agent. ConvApparel includes detailed turn-by-turn annotations of user states like satisfaction and frustration. A three-pillar validation framework was created to evaluate simulator fidelity. This framework includes population-level statistical alignment, a human-likeness score, and counterfactual validation. Counterfactual validation assesses how simulators adapt to unexpected, out-of-distribution assistant behavior. Experiments showed that while data-driven simulators (ICL and SFT) improved upon prompted ones, a realism gap persists. However, data-driven simulators demonstrated robustness by realistically shifting behavior when interacting with the frustrating "bad agent." The ConvApparel dataset and framework provide tools to measure and bridge the realism gap in user simulators, crucial for developing reliable conversational AI.

https://research.google/blog/convapparel-measuring-and-bridging-the-realism-gap-in-user-simulators/ research.google

RSS Hunter • Apr 8