The small numbers problem

Using persistence in spatial time series as a diagnostic for extreme rates in small areas.

In my last blog on the small numbers problem, we found that rates calculated with small denominators (e.g. small at-risk populations) have high variance and we thus have little confidence in the rate estimate. I presented a simple visual diagnostic process in SpaceStat for determining whether one should be concerned about the small numbers problem. Now let’s suppose you observe a high rate in an area with a small population. What might cause the high rate? We now know the high rate might be attributable to the high variance because the area has a small population. But we cannot preclude the possibility that the underlying risk in that area is actually high. This raises an important question. How might we distinguish between high rate estimates due to small numbers and high rate estimates attributable to high underlying risk?

To make this more concrete consider a simple example. Suppose a superfund site has been emitting known carcinogens for childhood leukemia into the ground water, and that a small community adjacent to the site relies on groundwater as its drinking water source. The rate for childhood leukemia in that town is quite high – 2-3 times the state average – but the town is small, and because of this the variance in the town childhood leukemia rate is high and is not statistically different from the state average.

There are several additional issues to consider here, including exposure routes and mechanisms, whether biomarkers of exposure exist and are measureable, whether the suspected exposure is biologically plausible, and whether the suspected compound (if there is one) is actually present in household water supplies. But as a first step we often are asked to address this question: Is there actually an excess of disease in the town? And this brings us back to the small numbers problem – how can we detect true, elevated risk when the population impacted is small?

Rate stabilization approaches (e.g. empirical Bayes) at first blush appear to be one possibility, but these involve something that has been called “shrinkage towards the mean”. Here one assigns a weight to an area’s observed rate based on the size of the population at risk; when the population is small the weight is small, when the population is large the corresponding weight is large. One then stabilizes the rate by “borrowing” the rate estimate from surrounding areas (a local smoothing) or from a larger reporting area (e.g. the state average, a global smoothing). Going back to our small town, one thus would give less weight to the observed leukemia rate in that town (because the population is small), and derive a new rate estimate that is comprised primarily of the state average (shrinkage towards the mean). But wait a moment. What if the true rate in the small town is actually high? This smoothing procedure would actually obscure that signal, and we might come to the incorrect conclusion that childhood leukemia isn’t elevated. What can be done?

This brings us to the treatment of spatial time series as experimental systems. The underlying concept is very simple. When we observe a disease rate in a small area through time we can use repeated observations – the time series in that small area – to garner additional information on whether the underlying rate is elevated or not. When the rate is truly elevated we will expect the observed rate through time to be high and remain, even though the variance about that rate may be large due to the small numbers problem. This illustrates the concept of persistence in spatial time series, and how we can use persistence to garner additional information regarding whether or not an observed rate is truly unusual.

Notice information on persistence in small areas can be lost when the data are first smoothed since repeated, elevated rates in areas with small populations will each “shrink to the mean”. I therefore prefer to inspect the spatial time series initially using the raw data themselves, and then make decisions about rate stabilization later.

To try this out download the SpaceStat project for this blog (“SmallNumbersBlog2”). You can download and install a trial version of SpaceStat here.

Map and scatterplot of National Cancer Institute lung cancer mortality in white males in State Economic Areas (SEA’s) from 1970 to 1995 — National Cancer Institute lung cancer mortality in white males in State Economic Areas (SEA’s) from 1970 to 1995.

When you load the project you should see something like this (above). The data come from the National Cancer Institute and we’ll be working with lung cancer mortality in white males in State Economic Areas (SEA’s) from 1970 to 1995. The rates are age-adjusted and reported per 100,000 population. The data are recorded in 5 year periods. I like to use the following protocol to inspect spatial time series for persistence that may indicate the existence of an underlying rate that is truly unusual.

Animate the maps and scatterplot by clicking on the play button on the animation toolbar . This gives you some sense of whether the distribution of rates as a group is changing through time. You notice on the scatterplot that the points tend to move to the right, reflecting population growth from 1970 through 1995. You also might notice the map turns redder as time progresses, suggesting the lung cancer mortality in white males is increasing.
These are a spatial time series since the data are recorded in five year periods, and we can inspect the time series using the time plot tool . I chose the variable “WLUNG”.

Create time plot dialog - SpaceStat
Dia — Dialog to create a time plot

Inspection of the time plot confirms our observation that lung cancer mortality overall is increasing through time. Each line on the time plot is the lung cancer mortality rate in a given SEA. You can click the lines in the time plot to identify the SEA on the map to which they correspond.

Do you think the two SEA’s with very low rates in 1970 had low rates that persisted through time, or did those rates bounce around a lot perhaps due to the small numbers problem? To answer this question I brush select the two lowest rates on the scatterplot at 1970 (below).

Not only do those two SEA’s have the lowest rates in 1970, but those rates are persistently low over the next 25 years, all the way to 1995.

What might explain this persistence of low rates? To address this question I inspect the map, and right click on the selected areas to see where they are and who might live there (below. The SEA’s under consideration are Logan and Provo, Utah, which have large Mormon populations that do not smoke. Based on this, do you think the low mortality rates in these two SEAs might most reasonably be attributed to behavioral factors or to the small numbers problem?

This blog I hope has illustrated how we can treat spatial time series as repeated experiments to evaluate whether extreme rates (whether high or low) persist through time in local populations. The health analyst now has at their disposal highly sophisticated tools for rate stabilization that typically involve local smoothing of rates and/or “shrinkage to the mean”. This includes popular techniques such as kernel smoothing and Bayesian rate stabilization. Be aware, however, that these methods can “smooth out” truly high (or low) rates. Evaluating persistence in spatial time series avoids this problem.

Read the next installment in this series: Small numbers problem part 3.

The small numbers problem–Part 2

Recent Posts