The small numbers problem

The ability to quickly create maps of health outcomes such as cancer incidence and mortality in counties, census areas and even Zip codes is now available through websites and data portals. (See for example Atlasplus, State Cancer Profiles, and Cardiovascular Disease, and many others.)

Often these maps appear with measures of the “uncertainty” in the underlying rate estimates, which, for example, might be the colon cancer incidence in Washtenaw County, Michigan. Sometimes all we see are the maps themselves, and the eye is immediately and quite naturally drawn to areas with extremely high rates. If you live there you are concerned, if you are a public health professional you might decide to evaluate the high rate, or put in place a health intervention to expand screening and access to care in that local area. But is the rate (e.g. colon cancer incidence in a given county) truly high, or might it be explained by the “small numbers problem”? What is the small numbers problem, and why should you care about it?

Rates are calculated from a numerator, such as the number of incident colon cancer cases; and a denominator, which here would be the population at risk (e.g. adults) for colon cancer. The rate is calculated by dividing the numerator by the denominator, and here is where the small numbers problem arises. It turns out that the variance in the rate depends critically on the size of the denominator. When the denominator is small, variance in the rate is high, when the denominator is large, variance in the rate is small. Hence the appearance of an apparently large rate might be due entirely or in part to the small numbers problem, and the true, underlying risk might be entirely unremarkable.

So how can we quickly evaluate whether we need to worry about the small numbers problem? I like to use a simple protocol in SpaceStat™.

Steps

Create a map of the rate.
Create a scatterplot of the rate (on the x-axis) and the population at risk (on the y-axis).
Inspect the scatterplot for the “Greater Than” signature (e.g. “>”) such that variance in the rate is larger at small population sizes.
Brush select on the scatterplot to see where the areas with high rates and low population sizes appear on the map. These are the places with apparent high rates that are unstable due to the small numbers problem. You don’t have much confidence that these rates really are high!

Let’s see how this works. As an example, I used SpaceStat™ to explore the small numbers problem for lung cancer mortality in white males in state economic areas (SEA’s) in the 1970’s. I transformed the white male population size using the square root since the population distribution has a long right tail, as is typical of study areas that include rural and urban places. I then plotted lung cancer mortality as a function of the square root of population size (Figure 1, left). Do you see the “greater than” signature (e.g. “>”) that indicates variance in lung cancer mortality is much higher at smaller population sizes? I then brush selected on the scatterplot to identify areas with high lung cancer mortality rate and low population size, these are now circumscribed with heavy gold borders on the map. Do you have much confidence that the areas outlined in gold have a true excess of lung cancer mortality, or do you think it might be explained by the small numbers problem? Clearly, the variance in the rate is high in these areas, and the high rate we observe might be due entirely to small numbers.

Another post in this series (Small Numbers – Part 2) will describe what to do in SpaceStat™ to take the small numbers problem into account in order to accurately find areas of high risk.

The small numbers problem Part 1: What you see is not necessarily what you get

Steps

Recent Posts