VentureBeat
Follow
Together AI's ATLAS adaptive speculator delivers 400% inference speedup by learning from workloads in real-time
Enterprises deploying AI are encountering performance limitations due to static speculators that cannot adapt to evolving workloads. These speculators work with large language models to draft multiple tokens in advance, significantly improving inference speed and reducing costs. Together AI has introduced ATLAS, a new system featuring adaptive learning for inference optimization, promising up to 400% faster performance. Static speculators, trained on fixed datasets, lose accuracy as AI usage patterns change, leading to degraded inference speeds. ATLAS employs a dual-speculator architecture with a stable static model and a lightweight adaptive model that learns from live traffic. A confidence-aware controller dynamically selects the appropriate speculator, allowing for dynamic adjustment of speculation lookahead. This adaptive approach offers performance comparable to specialized hardware like custom chips, achieving high token generation rates. The performance gains stem from better utilization of compute capacity by trading idle processing for reduced memory access. ATLAS functions like an intelligent caching layer, learning patterns rather than storing exact responses. Use cases include reinforcement learning training and adapting to shifting enterprise AI applications. ATLAS is now available on Together AI's platform at no additional cost, indicating a broader industry shift towards continuously learning inference systems.