Model cascading Machine Learning; MLOps
noun phrase
Definition: A multistage inference setup in which inputs are processed by a sequence of models, typically beginning with a smaller or cheaper model and escalating only selected cases to a larger or more accurate model, in order to balance cost, latency, and quality. Recent ML-systems literature defines these systems as model cascades built around routing or deferral rules that decide which requests should be handled locally and which should be passed to a stronger model [Rabanser et al. 2025].
Example in context: “In contrast, model cascades leverage a deferral rule for selecting the most suitable model to process a given request …” [Rabanser et al. 2025].
Synonyms: model cascade; cascaded models
Related terms: model router; cascade classifier; staged inference; deferral rule; speculative decoding