ML models are routinely used to service requests in interactive applications such as real-time video analytics 13, 49, recommendation engines 50, or speech assistants 12. To manage such workloads, applications employ platforms such as ONNX runtime 5, TensorFlow-Serving 39, PyTorch Serve 9, Triton Inference Server 4, among others. These platforms exchange model(s) from applications, often in graph formats. Common SLOs are in the 10-100 milliseconds of e-g., for live video analytics.
hackernoon.com
hackernoon.com