About Model Selection Tools

Methods such as best subset and stepwise regression are tools that build and compare suites of regression models, often through some form of automatic procedure. The user specifies the response (dependent) variable, and a suite of potential predictor (independent) variables, and through repeated application of a set of rules for selection, the method outputs a "best" set of predictors. These data mining techniques have been included in SpaceStat as tools for exploratory model-building, but note that many statisticians criticize the validity of the output (i.e., p-values, R-squared values) and inferences from these approaches (to see some common criticisms, click here). We view these tools as having a place within the data exploration toolbox: In early stages of modeling relationships within your geographic data, you can use best subset or stepwise regression to compare a variety of models, and then trace how the fit of various models changes with changes in your variable values over time.

Options for automatic model selection in SpaceStat

We have divided the regression model selection tools into three groups (1) "best subset" approaches, (2) forward stepwise or forward selection, and (3) backward stepwise or backward removal. When you run an aspatial regression model, you have the choice of using the "Full model", or of choosing one of these tools from a dialog box on the regression settings page. Choosing "best subset" will activate another window in which you will choose among methods for model selection, while choosing forward or backward stepwise regression will activate two windows where you will enter p-values that SpaceStat will use to evaluate whether terms should be dropped or added to the model. For "stepwise" methods, this process is iterative, but you can choose p-values to achieve what other statistical tools call forward selection or backward removal by choosing either forward or backward stepwise from the pull-down menu, and then specifying extremely permissive "P to stay" (P = 1) or restrictive "P to enter" (P = 0) values (see forward selection and backward removal for details).

When using model selection tools, you can run any type of model (linear, logistic, or Poisson), and include any variables associated with the source geography (numeric or categorical variables). As with aspatial regression using the "full model" option, you can also create squared or interaction terms, and can choose the type of coding used for categorical variables. Note also that although groups/levels within a categorical variable will have individual parameter estimates, each categorical variable is dropped or added from a model as one unit (i.e., categories are not dropped individually). Finally, you can still select the zero-intercept option with best subset/stepwise regression for models using sets of numeric variables for which you know the dependent variable will be zero when the independent variables are all zero.

Model selection measures and tools

The tools for model selection use several of the measures described in pages on implementation of aspatial regression as ways to rank models, such as R-squared and adjusted R-squared for linear regression, Cox and Snell R-squared for logistic regression, and log-Likelihood for Poisson regression. Output for all of these tools will also include a measure called Mallows' C(p), a subsetting criterion developed by Colin Mallows that is intended to help avoid the problem of colinearity (see below) within subset models. Mallows C(p) (just shown as C(p) in output tables) is defined as the residual sum of squares with P regression terms in the model (a subset of K included in the full model), divided by the error variance for the full model, plus twice the number of regression degrees of freedom minus the total number of observations. For the full model (with K terms), C(p) is equal to the number of regression degrees of freedom. Values of C(p) that are similar to P are considered good candidate models. Other measures that are only included in output from the best subset option are described on the best subset page.

A general rule for exploring models with stepwise and best subset tools: avoid colinearity

Although SpaceStat can handle fairly large numbers of datasets when running the model selection tools, it is a good idea to minimize colinearity (correlation among predictor variables). This general principle is true for all regression models, but is particularly likely to occur and cause problems if you are working with large numbers of variables. Although colinearity does not lead to major problems with prediction, it can be a real problem if you want to evaluate the importance of individual predictors (the goal of selection tools), because the high correlation among variables makes it hard to find reliable estimates of each variable's coefficient.

One way to quickly evaluate correlation among a large number of datasets is to complete a "full model" regression, and then examine the correlation matrix that this procedure produces. If you find that two variables are highly correlated with each other, it is probably best to only use one of them in the group you input during a stepwise or best model run. To see if one performs better than another, you can always run the stepwise process twice, substituting one for the other and comparing the results.

Follow the links below to see more information on each of the tool types:

Best subset

Forward stepwise and forward selection

Backward stepwise and backward removal