One More Reason to Prefer Simple Models

When building models for classification and regression, the question arises often: Go with a simple model that’s easy to understand but doesn’t have the highest accuracy? Or go with the model that’s impressively complex but much more accurate?

The needs of the situation often force one choice over another. If explainability is important, the simple model may win. If black-boxes are fine and it is all about accuracy, the complex model may be chosen. If the accuracy is roughly the same, Occam’s Razor may point to the simpler model.

I recently came across a different reason for preferring the simpler model.

In Classifier Technology and the Illusion of Progress, David Hand argues that the accuracy advantage of the complex model may not persist for long [note that he refers to the data used to train and validate the model as the “design distribution”]

The performance difference between two classifiers may be irrelevant in the context of the differences arising between the design and future distributions … more sophisticated classifiers, which almost by definition model small idiosyncrasies of the distribution underlying the design set, will be more susceptible to wasting effort in this way: the grosser features of the distributions (modeled by simpler methods) are more likely to persist than the smaller features (modeled by the more elaborate methods).

The apparent superiority of the more sophisticated tree classifier over the very simple linear discriminant classifier is seen to fade when we take into account the fact that the classifiers must necessarily be applied in the future to distributions which are likely to have changed from those which produced the design set … the simple linear classifier captures most of the separation between the classes, the additional distributional subtleties captured by the tree method become less and less relevant when the distributions drift. Only the major aspects are still likely to hold.

Data scientists are often cautioned that future data may be different from the data used for training the model. This advice isn’t new.

What I found interesting was the notion that, even when the data changes in the future, its major features are likely to hold up for longer or change more slowly than its minor features. And this, in turn, favors simpler models since they tend to use the major features.