Apparate: Early-Exit Models for ML Latency and Throughput Optimization - Challenges

Despite numerous EE proposal from the ML community, multiple issues complicate their use in practice, leading to low adoption rates. Although exiting can enable certain inputs to eschew downstream model computations, exit ramps impose two new overheads on model serving. For instance, DeeBERT inflates overall memory requirements by 6.56% compared to BERTbase.