Apparate: Early-Exit Models for ML Latency and Throughput Optimization - Early-Exit Models

Early exit models 53,57 present an alternate way to address this tension by rethinking the granularity of inference. The key premise is that certain ‘easy’ inputs may not require the full predictive power of a model to generate an accurate result. In such cases, the foregone model execution can yield proportional reductions in both per-request latencies and compute footprints.