SPP Week 1: GS, Explain or Predict

This week will be looking at (Shmueli 2010).

Overview

Shmueli spends the first part of the paper trying to draw out the differences between explanatory and predictive modelling. He outlines what he sees as a third category of descriptive modelling but states that he does not see this as used for theory building or for utility so does not address it. The middle section of the paper outlines the practical differences in each step of the modelling process that arise from doing explanation or prediction. The last part of the paper describes errors that can happen from confusing the two types of tasks. Throughout he goes through arguments to the point that statistics as a field ignores the role of prediction. In the past 11 years since the publication of this paper we believe that this has changed with the two being more muddled than ever to an overright focus on predicition (at least from our personal experience). To this end we will try and draw out the important points for the normative statistician which will be the distinction between explanation and prediction at the theoretical level, the operational level and the consequences of not following through with this.

The Difference

The difference between explanation and prediction is summed up by shmueli:

Why should there be a difference between explaining and predicting? The answer lies in the fact that mesaurable data are not accurate representations of their underlying constructs.

In this sense explaining is concerned with the analysis or scrutiny of theoretical causal structures which must be realized as concrete observables. In contrast, prediction works directly with the concrete observables. More concretely, explanation begins with a hypothesis that some effect \(\mathcal{X}\) causes some effect \(\mathcal{Y}\) through some process \(\mathcal{F}\). \(X\), \(Y\) and \(F\) must then be operationalised to concrete \(x\), \(y\) and \(f\). The model fitting process is then due the underspecification of \(\mathcal{F}\) to describe any particular concrete model. The goal is then to find a faithful \(f\) to our \(\mathcal{F}\) so that we can test or analyse the causal hypothesis.

In contrast, prediction begins with prespecified concrete variables \(x\) and \(y\) and the gaol is to produce some concrete \(f\) to predict \(y\) from \(x\).

Shmueli then draws of 5 dimensions along which these two notions of the modelling process differ on.

  1. causation-association (as previously discussed)
  2. theory-data (interpretability of f)
  3. retrospective-prospective (availabliltiy of data)
  4. bias-variance

Dwelling on this last point, we observe that this is framed in terms of prediction error. Using the bias variance decomposition we can see that this error is constituted from the variance in our data plus the bias of the model and the sampling variance.

Explanatory modelling is concerned with assessing the role of the causal theory - to this end we do not want any bias as we wish to operationalise the theory as correctly as possible taking into account all the variance that is truly in the world.

In contrast, with predictive modelling we are solely concerned with minimising the prediction error and so we may trade-off some misspecification of the model if we can reduce the sampling variance. This has important connotations for inferrring anything causal from predictive models and performing prediction with explanatory models.

Differences in practice

We refrain from the specifics here as they can be found in the paper itself but differences can be found in:

  1. The study design & data collection stage: hierarchal sampling - group size vs group number. Sample size bias vs reducing sample variance. cleanliness vs dirtiness of data (nomenological machines for causal explanations vs in-situ data for prediction). Data collection instrument: needs to reliably match theory in explanation and be of high quality and availability for prediction. Experimental design also has differences in terms of linearity of your factorial decomposition for interpretability.
  1. data prep. uninteresting but important stuff about retaining missing values in prediction in the case where they may be missing values in-situ. Furthermore many missing data strategies in prediction are irrelevant to explanation.

  2. Data partitioning: the need for a hold-out and test set for predicition as we care about out-of-sample performance. Because power is a concern in causality the loss of power from partitioning is often undesirable.

  1. EDA (explanatory data analysis)- uninteresting but worth going through if we ever want to write stuff on the process.
  1. Variable choice: dont drop important predictors in explanatory models. Multicollinearity undesirable at all levels due to sampling error increases which don’t matter to prediction. Variables must be available at time of predicition if performing predictions.

  2. Choice of methods: algorithmic, bias introducing methods allowed in predicition (ridge regression), cant use coincident indicators in predictive modelling without lag or a secondary model.

  3. validation: just totally different techniques - need to read more on hausman and explanatory validation.

  4. evaluation: generalisation error vs r2 or better causal metrics.

  5. model selection: keep in statistically insignificant covariates if they are important to the causal theory in explanation. Dropping theoretically important predictors can help in predictive modelling. AIC for prediction, BIC for goodness of fit in the frequentist setting. Minimum length message, min expec KL divergence for inference vs predicability in the bayesian model selection case.

The paper finishes with two case studies and then discusses cases where R2 have been used to justify a method producing good prediction

Campbell and Thompson 2005 discusses how to look for predictively accurate explanatory variables which is probably worth a look into.

Bibliography

Shmueli, Galit. 2010. β€œTo Explain or to Predict?” Statist. Sci. 25 (3). The Institute of Mathematical Statistics:289–310. https://doi.org/10.1214/10-STS330.