The article describes building a local RAG benchmark to avoid reliance on expensive APIs and external servers for RAG system development. It introduces a system using Ollama and Ray for local inference, mirroring the OpenAI API for compatibility with existing benchmarks. The architecture leverages a RAGOpenAICompatibleModel class, enabling the use of different local models simply by changing a key. Ray is employed for parallel processing of HTML data to improve efficiency and reduce resource usage. HTML cleaning, including tag removal and text splitting, further optimizes the context window and enhances accuracy. The article presents the results of testing several models using this framework, revealing their performance in a RAG scenario. The analysis highlights Qwen model's failure due to its Chain-of-Thought approach, leading to hallucinations. The author provides a GitHub repository with Docker configurations and scripts for easy deployment and experimentation with this local benchmark. The author concludes that this approach enables cost-effective and secure RAG development and testing. Finally, the author plans to compare the local CRAG benchmark with metrics like RAGas in future research.
dev.to
dev.to
