Beyond Two Towers: Re-architecting the Serving Stack for Next-Gen Ads Lightweight Ranking Models…
The authors aimed to upgrade their ad serving system beyond the Two-Tower model to leverage more complex neural networks requiring a GPU-based inference stage. The primary challenge was integrating this new stage without increasing latency in their highly optimized serving funnel. They addressed the feature fetching bottleneck by bundling high-value candidate features directly within the model and employing a high-performance key-value store for others. Business logic, such as filtering and sorting, was moved into the model for efficiency, minimizing data transfer. Significant latency reduction was achieved through GPU optimizations, including multi-stream CUDA and kernel fusion. The authors also re-architected the retrieval data flow, returning essential metadata first and fetching the rest later. Further latency improvements came by introducing parallel paths for feature expansion. Finally, an unexpected shift in metrics emerged due to the switch from local to global ranking, requiring careful analysis and tuning to maintain performance. This transition represents a significant re-architecture effort to increase recommendation quality.