DEV Community
Follow
I built a Rust entropy monitor to route LLM inference — here's what the benchmark showed
Frontier LLM inference is costly, prompting the development of Buddy System, a tiered inference architecture. This system aims to maximize local model usage before resorting to expensive cloud calls. A Rust EntropyMonitor tracks per-token uncertainty during local generation by a 4B model running on Apple Silicon via MLX. When the local model exhibits high entropy, indicating genuine uncertainty, specifically at clause boundaries, spaCy NER identifies relevant named entities or noun chunks. A sentence-transformers retriever then finds pertinent passage chunks for context. The cloud model, Sonnet, receives a targeted query comprising the uncertain fact and the grounding document. Importantly, cloud calls are asynchronous, ensuring local generation is never blocked. Classical tools handle deterministic tasks like math and units at zero cost. Benchmarks show Buddy System achieves 71.4% accuracy with minimal cost compared to local-only (70.7% accuracy, $0.00 cost). The advisor pattern, however, surprisingly underperformed in specific datasets like SQuAD v2 and HotpotQA. This is attributed to the advisor receiving the answer without the source document, relying on parametric memory instead of grounding. Buddy System's success lies in passing the document context to the review tier, demonstrating the importance of context for accurate LLM performance.