To follow along with the analyses in this blog, download and install a trial version of SpaceStat **here**.

An earlier blog defined the small numbers problem and illustrated that rates calculated with small denominators (e.g. small at-risk populations) have high variance and result in unstable rate estimates that are poor representations of true, underlying risk. A later blog showed how spatial time series may be used to identify true elevated rates in local areas with small population sizes. This addressed an important question: How can one distinguish between high rate estimates due to small numbers and high rate estimates attributable to high underlying risk? In practice, spatial epidemiology and disease surveillance must almost always deal with the small numbers problem. We want to be able to make inferences in small areas, such as neighborhoods, estimate disease rates, identify disease clusters, build and evaluate models, and undertake health disparities analyses, all while taking into account differences in population sizes. This blog demonstrates diagnostics for the small numbers problem in SpaceStat.

As one of the first steps in a geohealth analysis, I often start off with four techniques for assessing the small numbers problem:

- the plot of rate vs population,
- evaluating persistence of rates through time using the timeplot,
- the box plot, and
- the variogram cloud

The first two of these were demonstrated in the earlier blogs mentioned above. This blog describes use of the boxplot to evaluate statistical outliers. Use of the variogram to identify locations with high impact on measures of spatial correlation and variance is the topic of a later blog.

**The box plot**

Recall from your basic statistics that the boxplot provides a rapid visual summary of the distributional properties of the data. We’ll explore this using data describing lung cancer mortality through time in in the United States. The data come from the National Cancer Institute and we’ll be working with lung cancer mortality in white males in State Economic Areas (SEA’s) from 1970 to 1995. The rates are age-adjusted and reported per 100,000 population. The data are recorded in 5 year periods.

To follow along, start SpaceStat and load the project: **SmallNumbersBlog3—spatial outlier diagnostics.**

When you load the project you should see something like this (above). Create a boxplot by selecting the boxplot tool from the tool menu.

Then enter the geography (“SEArates-male”) and dataset name (“WLUNG”) to create a boxplot.

As described in the SpaceStat help, the boxplot provides a quick visual summary of the distributional properties of a variable. The box plot displays the relationship between key measures that describe your dataset, including the median, the quartiles, maximum and minimum values, and a range of values identified as “outliers”. The median is the black line that bisects the “box” that gives the box plot it’s name. The first (lower) and third (upper) quartiles, which are the medians of the lower and upper halves of the data, form the lower and upper bounds of the box. The data for individual locations are shown as grey bars inside the box and grey points outside the box. The “whiskers” on the box, the horizontal bars at the top and bottom, are placed at 1.5 x the interquartile range (see the illustration on the image below). Points outside the whiskers are considered outliers. The vertical line on which the box and whiskers sit is the range of the dataset over all times included in your SpaceStat project.

In addition to the box plot, I like to inspect the distribution of the data, and how that distribution changes through time, using a histogram. Create a histogram by selecting the histogram tool

Then enter the geography (“SEArates-male”) and dataset name (“WLUNG”) to create a histogram.

These are time-dependent data, and we would like to see how statistical outliers on the boxplot, the distribution on the histogram, and the map patterns change through time. To accomplish this, let’s synchronize the boxplot, histogram and map so they animate together. Select “Window” from the main menu, and “time synchronize all views”. You now can click on the animation bar on either the map or the box plot to explore how values change through place and time.

Rearrange the map and statistical graphics to look something like this, and click the play button to animate.

Through time, you’ll see the map becomes redder, the center of the distribution of the histogram increases (shifts right), and the distance between the whiskers on the boxplot grows larger. These indicate both the average lung cancer mortality and its variance are increasing through time. But what about the outliers?

Remember the small numbers problem means that variance of disease incidence and mortality rates is larger in areas with small populations. One would thus expect there to be statistical outliers in locations with small populations. Is this observed for these data? To answer this create the scatter plot of “WLUNG” on the y-axis and “Sqrt White Population” on the x-axis, and move the time slider to be 1/1/1970. Now brush select (move your cursor over and click on) the most extreme high value on the boxplot. This observation is outside the whisker and is a statistical high outlier.

Notice the selected observation is highlighted on the map and the other graphical views. Zoom in on the map to see where this outlier is. Then right click on the map and “Inspect this location” to learn the high outlier in white lung mortality rate at 1/1/1970 is Phoenix City, Alabama, with 111.967 deaths per 100,000. The mean rate in the U.S. was 60.148 deaths per 100,000, (mean value on the box plot and histogram) so this is almost double the mean death rate! But is this high rate attributable to the small numbers problem, or to a persistent underlying cause?

Looking at the scatter plot of WLUNG vs. Sqrt White Population, we see the population in Phoenix City, Alabama is quite small relative to other SEA’s, and in that part of the plot that has high variance. Remember from the second blog in this series that we can inspect the persistence of a high or low rate to evaluate whether it might be attributable to the small numbers problem. If a high or low rate is attributable to small numbers we would expect that value to jump around a good deal through time (e.g. it is a statistical aberration), but if it is due to a true underlying cause it should continue to be high (or low) through time. Does the high rate in Phoenix City persist through time? Animate the graphics. You’ll find that the Phoenix City rate is not statistically unusual in any of the other time periods, leading us to conclude the extraordinary high rate in early 1970 likely is a statistical aberration due to the small numbers problem.

In contrast, select the lowest value on the box plot in 1/1/1970, and then animate the views. You’ll see this rate, found in Logan, Utah, is always a low statistical outlier. As described in blog 2 in this series, this likely is due to the large proportion of Mormons who do not smoke.

We’ve now seen how to use the boxplot, frequency histogram, map and statistical brushing to identify outliers. We’ve also learned how to use persistence through time to assess whether a statistical outlier is attributable to the small numbers problem or to an underlying risk factor or protective factor.