It is the 50th anniversary of some of the most important findings used in modern statistics and data-science

Featured

It is now 50 years since the invention or discovery of three of the most important techniques used in modern statistics and data-science, namely, Cox regression (1972), the Dirichlet process (1973), and the AIC (Akaike information criterion, 1974).

Published

December 21, 2023

David Cox’s regression model for right-censored data has had an enormous impact on survival analysis. Survival analysis typically considers data-sets in which time-to-death is recorded as the main outcome: such data-sets typically also contain data on those who survived past the follow-up period (i.e., right-censoring). The Cox regression model solves this problem, in addition to facilitating computations and interpretation of results, and hence has become very widely used in medical statistics and beyond, e.g., to test the predictive ability of medical biomarkers.

The Dirichlet process (1973) forms the basis for, amongst other things, Bayesian non-parametrics (BNP). Statisticians and data-scientists are used to thinking about distributions of data values. But BNP stretches this definition of a distribution, to a distribution of distributions. Since a random variable with a continuous distribution has an infinite number of possible values, we think of this as having an infinite number parameters, or alternatively as being non-parametric. The Dirichlet process models this distribution of distributions that is centered on a ‘base’ distribution (like a Gaussian), analogously to how a Gaussian models a distribution of data values centered on the mean. One particular success of this method is in data clustering, to automatically detect the number of clusters.

The AIC (1974) addresses the problem of overfitting in statistical modelling. Overfitting of statistical models occurs when the model has too much flexibility or complexity, so that after the model is fitted to data it represents the data too precisely, even including random variation that is not of interest. The consequence of overfitting is that the model then does not generalise well to new data-sets. The AIC overcomes this problem, by introducing a trade-off between how well the model fits the data, and the complexity of the model. As George Box said almost 50 years ago: ‘all models are wrong, but some models are useful’. The AIC encapsulates this trade-off by acknowledging that the model may need to deviate slightly from the observed data to maximise its utility in generalising to new data.

References

Akaike, H. (1974), “A new look at the statistical model identification”, IEEE Transactions on Automatic Control, 19 (6): 716–723, Bibcode:1974ITAC…19..716A, doi:10.1109/TAC.1974.1100705.

Box, George E. P. (1976), “Science and statistics” (PDF), Journal of the American Statistical Association, 71 (356): 791–799, doi:10.1080/01621459.1976.10480949.

Cox, David R (1972). “Regression Models and Life-Tables”. Journal of the Royal Statistical Society, Series B. 34 (2): 187–220.

Ferguson, Thomas (1973). “Bayesian analysis of some nonparametric problems”. Annals of Statistics. 1 (2): 209–230. doi:10.1214/aos/1176342360