To follow along with the analyses in this blog download and install a trial version of SpaceStat **here**.

An earlier blog defined the small numbers problem and illustrated that rates calculated with small denominators (e.g. small at-risk populations) have high variance and result in unstable rate estimates that are poor representations of true, underlying risk. A later blog showed how spatial time series may be used to identify true elevated rates in local areas with small population sizes. This addressed an important question: How can one distinguish between high rate estimates due to small numbers and high rate estimates attributable to high underlying risk? In practice, spatial epidemiology and disease surveillance must almost always deal with the small numbers problem. We want to be able to make inferences in small areas, such as neighborhoods, estimate disease rates, identify disease clusters, build and evaluate models, and undertake health disparities analyses, all while taking into account differences in population sizes. This blog demonstrates diagnostics for the small numbers problem in SpaceStat.

As one of the first steps in a geohealth analysis, I often start off with four techniques for assessing the small numbers problem:

(1) the plot of rate vs population,

(2) evaluating persistence of rates through time using the timeplot,

(3) the box plot, and

(4) the variogram cloud

The first two of these were demonstrated in the earlier blogs mentioned above. This blog describes use of the boxplot to evaluate statistical outliers. Use of the variogram to identify locations with high impact on measures of spatial correlation and variance is the topic of a later blog.

**The box plot**

Recall from your basic statistics that the boxplot provides a rapid visual summary of the distributional properties of the data. We’ll explore this using data describing lung cancer mortality through time in in the United States. The data come from the National Cancer Institute and we’ll be working with lung cancer mortality in white males in State Economic Areas (SEA’s) from 1970 to 1995. The rates are age-adjusted and reported per 100,000 population. The data are recorded in 5 year periods.

To follow along, start SpaceStat and load the project: **SmallNumbersBlog3 — spatial outlier diagnostics.**