DEV Community
Follow
12 million tokens, linear cost: Subquadratic's bet against the attention tax
The quadratic attention mechanism in large language models has created limitations, forcing developers to use workarounds like Retrieval-Augmented Generation (RAG). Subquadratic, a new startup, introduces a model using Subquadratic Selective Attention (SSA) that claims linear scaling in compute and memory with regard to context length. This allows for a 12-million-token context window, available now through their API. SSA addresses the quadratic bottleneck of dense attention by selectively processing relevant token combinations, based on content. The company's benchmarks show considerable improvements over existing models in areas like MRCR v2, Needle-in-a-haystack, and SWE-Bench. If SSA's linear scaling proves true, it will revolutionize the economics of RAG and agent-based systems, currently constrained by the cost of long contexts. However, caution is advised due to the history of ambitious claims in the field and the smaller size of Subquadratic's model compared to the largest ones. Users should test Subquadratic's API, especially if they are heavily invested in RAG or long-context tasks. Subquadratic also offers a coding agent, and plans to release a 50M-token context window model later. Ultimately, the technical capabilities seem promising, but a healthy dose of skepticism is warranted.