Apparate: Early-Exit Models for ML Latency and Throughput Optimization - Background and Platforms

ML models are routinely used to service requests in interactive applications such as real-time video analytics 13, 49, recommendation engines 50, or speech assistants 12. To manage such workloads, applications employ platforms such as ONNX runtime 5, TensorFlow-Serving 39, PyTorch Serve 9, Triton Inference Server 4, among others. These platforms exchange model(s) from applications, often in graph formats. Common SLOs are in the 10-100 milliseconds of e-g., for live video analytics.

hackernoon.com

TheNote.app (macOS, iOS and Android apps)

2024-10-02