Speech-to-Retrieval (S2R): A new approach to voice search

Voice-based web search, while common, faces accuracy issues due to the cascade modeling approach. This method converts speech to text first, and any errors in transcription can lead to irrelevant search results. For instance, misinterpreting "scream" as "screen" in a query about a painting can yield completely wrong information. To address this, Speech-to-Retrieval (S2R) technology bypasses the text transcription step altogether. S2R directly interprets spoken queries and retrieves information by mapping speech to retrieval intent. This architectural shift aims to answer "What information is being sought?" rather than just "What words were said?". Experiments show a significant performance gap between current cascade systems and theoretically perfect transcription. The S2R model, using a dual-encoder architecture, learns to represent audio queries and documents in a shared space. This allows it to directly infer the user's intent from the audio. Evaluation on the SVQ dataset demonstrates that S2R significantly outperforms traditional cascade ASR models. Its performance closely approaches the theoretical maximum achievable with perfect speech recognition. Google has now implemented S2R-powered voice search in multiple languages. They are also open-sourcing the SVQ dataset to encourage further research in this area.

https://research.google/blog/speech-to-retrieval-s2r-a-new-approach-to-voice-search/ research.google

RSS Hunter • Oct 6, 2025