About Best Subset Selection

The Best Subset tool for exploring regression models carries out calculations for all models with or without each of the regression terms that you have specified in the regression model dialog.  SpaceStat then orders the suite of models in a table according to the selection criterion that you choose (the table also lists values of the other criteria available for that form of regression).  The options for selection criteria for each model form (linear, Poisson, and logistic) are activated in the regression settings dialog when you select the Best subset option.  As the number of models to compares grows rapidly as you increase the number of potential predictor variables, this method may become very slow with 10 or more variables.  Recall too that issues of colinearity are a major concern for exploratory model building, which is an even better reason than analysis speed to look at a correlation matrix and reduce the number of variables you include.

The selection criteria for linear models are:

  1. Largest R-squared.  This option selects the model with the largest reduction in residual sum of squares, and will favor the model with the largest number of terms in the model.

  2. Largest adjusted R-squared.  This option  the mean square errors with and without the regression terms in the model and punishes models with too many terms.

  3. Smallest AIC (Akaike information criterion).  The AIC is another tool that trades off model fit and model complexity.  For linear regression, this is the residual sum of squares (RSS) penalized by two times the number of regression term degrees of freedom (k = the number of regression parameters).  The general formula for AIC = 2k-2ln(L); assuming model errors are normally and independently distributed, this can be expressed as AIC = 2k + nln(RSS/n), where n = the number of locations.

  4. Smallest Mallows C(p)- this measure is similar to AIC, and penalizes models for having higher numbers of terms.  Mallow C(p)equals the residual sum of squares with "p" regression terms in the model, divided by the error variance for the full model plus twice the number of regression degrees of freedom minus the total number of observations. For the full model, C(p) is equal to the number of regression degrees of freedom. Similar C(p) values similar to the one for the full model are considered an indication of good candidate models.

The selection criteria for logistic models are:

  1. Largest Cox and Snell R-squared, which is similar in concept to an R-squared; it describes the strength of the model at predicting 0 or 1 values.

  2. Largest Score Chi-Squared test.  The score chi-square is an approximation to the log-likelihood difference between two models and is used in some statistical programs because it is quicker to calculate than the exact log-likelihood difference.  It also increases with increasing model fit, and the choice of best model using Score Chi-squared and Cox and Snell R-squared are likely to be equivalent.

  3. Smallest SIC (Schwartz Information Criterion, also known as the BIC, for Bayesian Information Criterion).  The SIC is an increasing function of the residual sum of squares, and an increasing function of the number of parameters to be estimated ( k), so it punishes models for having more parameters.  The formula for the SIC = -2ln L +k ln(n), where n is the number of locations, and L is the maximized value of the likelihood function for the model.

  4. Smallest AIC (see above).

The selection criteria for Poisson models include 2-4 for logistic (above), and:

  1.  Largest log-Likelihood.  Poisson regression does not have a measure equivalent to R-squared, so SpaceStat rates the model with the largest log-likelihood calculation as "best".  SpaceStat presents the difference (deviance) between the log-likelihood of the subset model and that of a "perfectly-fitted" model for this final model in the output, as is typically presented for Poisson regression.

 

Table of Contents

Index

Glossary

-Search-

Back