The 43pl consortium operates a unit trust which facilitates participation in the CRCSI by a large number of small to medium sized enterprises. Founded in 2003, the CRC for Spatial Information (CRCSI) is a research organisation funded by Australia’s Cooperative Research Centre Program and by participant contributions.
Using persistence in spatial time series as a diagnostic for extreme rates in small areas.
In my last blog on the small numbers problem (src=”//www.biomedware.com/blog/2010/small_numbers_problem1/) we found that rates calculated with small denominators (e.g. small at-risk populations) have high variance and we thus have little confidence in the rate estimate. I presented a simple visual diagnostic process in SpaceStat for determining whether one should be concerned about the small numbers problem. Now let’s suppose you observe a high rate in an area with a small population. What might cause the high rate? We now know the high rate might be attributable to the high variance because the area has a small population. But we cannot preclude the possibility that the underlying risk in that area is actually high. This raises an important question. How might we distinguish between high rate estimates due to small numbers and high rate estimates attributable to high underlying risk?
To make this more concrete consider a simple example. Suppose a superfund site has been emitting known carcinogens for childhood leukemia into the ground water, and that a small community adjacent to the site relies on groundwater as its drinking water source. The rate for childhood leukemia in that town is quite high – 2-3 times the state average – but the town is small, and because of this the variance in the town childhood leukemia rate is high and is not statistically different from the state average.
There are several additional issues to consider here, including exposure routes and mechanisms, whether biomarkers of exposure exist and are measureable, whether the suspected exposure is biologically plausible, and whether the suspected compound (if there is one) is actually present in household water supplies. But as a first step we often are asked to address this question: Is there actually an excess of disease in the town? And this brings us back to the small numbers problem – how can we detect true, elevated risk when the population impacted is small?
Rate stabilization approaches (e.g. empirical Bayes) at first blush appear to be one possibility, but these involve something that has been called “shrinkage towards the mean”. Here one assigns a weight to an area’s observed rate based on the size of the population at risk; when the population is small the weight is small, when the population is large the corresponding weight is large. One then stabilizes the rate by “borrowing” the rate estimate from surrounding areas (a local smoothing) or from a larger reporting area (e.g. the state average, a global smoothing). Going back to our small town, one thus would give less weight to the observed leukemia rate in that town (because the population is small), and derive a new rate estimate that is comprised primarily of the state average (shrinkage towards the mean). But wait a moment. What if the true rate in the small town is actually high? This smoothing procedure would actually obscure that signal, and we might come to the incorrect conclusion that childhood leukemia isn’t elevated. What can be done?
This brings us to the treatment of spatial time series as experimental systems. The underlying concept is very simple. When we observe a disease rate in a small area through time we can use repeated observations – the time series in that small area – to garner additional information on whether the underlying rate is elevated or not. When the rate is truly elevated we will expect the observed rate through time to be high and remain, even though the variance about that rate may be large due to the small numbers problem. This illustrates the concept of persistence in spatial time series, and how we can use persistence to garner additional information regarding whether or not an observed rate is truly unusual.
Notice information on persistence in small areas can be lost when the data are first smoothed since repeated, elevated rates in areas with small populations will each “shrink to the mean”. I therefore prefer to inspect the spatial time series initially using the raw data themselves, and then make decisions about rate stabilization later.
To try this out download the SpaceStat project for this blog (“SmallNumbersBlog2”). You can download and install a trial version of SpaceStat here (src=”//www.biomedware.com/?module=Page&sID=spacestat).
When you load the project you should see something like this (above). The data come from the National Cancer Institute and we’ll be working with lung cancer mortality in white males in State Economic Areas (SEA’s) from 1970 to 1995. The rates are age-adjusted and reported per 100,000 population. The data are recorded in 5 year periods. I like to use the following protocol to inspect spatial time series for persistence that may indicate the existence of an underlying rate that is truly unusual.
- Animate the maps and scatterplot by clicking on the play button on the animation toolbar . This gives you some sense of whether the distribution of rates as a group is changing through time. You notice on the scatterplot that the points tend to move to the right, reflecting population growth from 1970 through 1995. You also might notice the map turns redder as time progresses, suggesting the lung cancer mortality in white males is increasing.
- These are a spatial time series since the data are recorded in five year periods, and we can inspect the time series using the time plot tool . I chose the variable “WLUNG”.
Inspection of the time plot confirms our observation that lung cancer mortality overall is increasing through time. Each line on the time plot is the lung cancer mortality rate in a given SEA. You can click the lines in the time plot to identify the SEA on the map to which they correspond.
3. Do you think the two SEA’s with very low rates in 1970 had low rates that persisted through time, or did those rates bounce around a lot perhaps due to the small numbers problem? To answer this question I brush select the two lowest rates on the scatterplot at 1970 (below).
Not only do those two SEA’s have the lowest rates in 1970, but those rates are persistently low over the next 25 years, all the way to 1995.
4. What might explain this persistence of low rates? To address this question I inspect the map, and right click on the selected areas to see where they are and who might live there (below). The SEA’s under consideration are Logan and Provo, Utah, which have large Mormon populations that do not smoke. Based on this, do you think the low mortality rates in these two SEA’s might most reasonably be attributed to behavioral factors or to the small numbers problem?
This blog I hope has illustrated how we can treat spatial time series as repeated experiments to evaluate whether extreme rates (whether high or low) persist through time in local populations. The health analyst now has at their disposal highly sophisticated tools for rate stabilization that typically involve local smoothing of rates and/or “shrinkage to the mean”. This includes popular techniques such as kernel smoothing and Bayesian rate stabilization. Be aware, however, that these methods can “smooth out” truly high (or low) rates. Evaluating persistence in spatial time series avoids this problem.
The ability to quickly create maps of health outcomes such as cancer incidence and mortality in counties, census areas and even Zip codes is now available through websites and data portals. (See for example Atlasplus, State Cancer Profiles, and Cardiovascular Disease, and many others.)
Often these maps appear with measures of the “uncertainty” in the underlying rate estimates, which, for example, might be the colon cancer incidence in Washtenaw County, Michigan. Sometimes all we see are the maps themselves, and the eye is immediately and quite naturally drawn to areas with extremely high rates. If you live there you are concerned, if you are a public health professional you might decide to evaluate the high rate, or put in place a health intervention to expand screening and access to care in that local area. But is the rate (e.g. colon cancer incidence in a given county) truly high, or might it be explained by the “small numbers problem”? What is the small numbers problem, and why should you care about it?
Rates are calculated from a numerator, such as the number of incident colon cancer cases; and a denominator, which here would be the population at risk (e.g. adults) for colon cancer. The rate is calculated by dividing the numerator by the denominator, and here is where the small numbers problem arises. It turns out that the variance in the rate depends critically on the size of the denominator. When the denominator is small, variance in the rate is high, when the denominator is large, variance in the rate is small. Hence the appearance of an apparently large rate might be due entirely or in part to the small numbers problem, and the true, underlying risk might be entirely unremarkable.
So how can we quickly evaluate whether we need to worry about the small numbers problem? I like to use a simple protocol in SpaceStat™.
Step 1. Create a map of the rate.
Step 2. Create a scatterplot of the rate (on the x-axis) and the population at risk (on the y-axis).
Step 3. Inspect the scatterplot for the “Greater Than” signature (e.g. “>”) such that variance in the rate is larger at small population sizes.
Step 4. Brush select on the scatterplot to see where the areas with high rates and low population sizes appear on the map. These are the places with apparent high rates that are unstable due to the small numbers problem. You don’t have much confidence that these rates really are high!
Let’s see how this works. As an example, I used SpaceStat™ to explore the small numbers problem for lung cancer mortality in white males in state economic areas (SEA’s) in the 1970’s. I transformed the white male population size using the square root since the population distribution has a long right tail, as is typical of study areas that include rural and urban places. I then plotted lung cancer mortality as a function of the square root of population size (Figure 1, left). Do you see the “greater than” signature (e.g. “>”) that indicates variance in lung cancer mortality is much higher at smaller population sizes? I then brush selected on the scatterplot to identify areas with high lung cancer mortality rate and low population size, these are now circumscribed with heavy gold borders on the map. Do you have much confidence that the areas outlined in gold have a true excess of lung cancer mortality, or do you think it might be explained by the small numbers problem? Clearly, the variance in the rate is high in these areas, and the high rate we observe might be due entirely to small numbers.
A future post will describe what to do in SpaceStat™ to take the small numbers problem into account in order to accurately find areas of high risk.