BioMeanings

Seeing the Patterns in Space and Time

Part 3: Spatial Autocorrelation and Clusters of Health Events

Posted on by Geoffrey Jacquez, Ph.D.

Part 3 Neutral models

This is the third in a series on spatial autocorrelation and clusters of health events. The first part presented a framework for analyzing disease clusters that builds on the principles of strong inference. Strong inference involves enumeration of all of the possible explanations of a disease cluster, some may be causal (such as an environmental hazard or exposure linked aetiologically to the health outcome), some may not (such as geographic variation in proportion of cases reported). Each of them may result in spatial autocorrelation in health events, the topic of the second part in this series. These sources of spatial autocorrelation may produce clusters, they also may mask clusters that are attributable to another cause (such as an environmental exposures). How may one account for these multiple causes of a cluster so that causes of interest to the investigator can be identified? This can be thought of as controlling for “nuisance” sources of spatial autocorrelation in the analysis. As an example, how might we account for background variation — the spatial autocorrelation found in disease rates — when a cluster process is absent? This of course is what we want to use as the “null hypothesis” for a statistical test for clustering, since random spatial patterns are rarely, if ever, found in the real world.

There are three approaches.

(1) Standardize the disease rate. For example, rates often are age-adjusted, and this removes spatial autocorrelation due to geographic variation in age of the underlying population.  This may be somewhat involved when multiple causes of “nuisance” spatial autocorrelation are present, and is not possible when all of the sources of background variation have not been identified.

(2) Formal mathematical models (e.g. regression, Bayesian etc) may be constructed with the nuisance sources of spatial structure on the right hand-side of the equation (as predictors).  One then analyzes the regression residuals (e.g. using the local Moran test in SpaceStat,  to identify clusters of excess risk.  Again, this requires detailed knowledge of and data on the predictors.

(3) Neutral models may be used to account for background variation in disease rates, as described in this blog.  Neutral models do not require that all of the sources of background variability be identified, and they can be incorporated into inferential tests for clustering, and support the cluster analysis framework of Strong Inference.

Role of neutral models

The previous blog provided an overview of some of the sources of spatial autocorrelation in health events.  When exploring disease patterns and clusters many of these sources of geographic variation may not be of direct interest, for example, we often may wish to account for spatial heterogeneity in population density when searching for the signature of causative exposures underlying clusters of disease. Here the idea is to search for clusters of health events above and beyond that attributable to geographic variation in population density.  This concept can apply to any source of geographic pattern that may not be of direct interest; those members, for example of the set of possible explanations described earlier under “Strong inference” and enumerated in “Sources of spatial autocorrelation in health events”.

When clustering health events one then incorporates geographic variability in covariates and other factors not considered to be of interest into the null hypothesis.  Mechanistically, this usually is accomplished using approximate randomization that includes observed variation patterns in those factors not of direct interest into the null spatial model (Waller and Jacquez 1995).  Models that accomplish this have been referred to as “neutral” rather than “null” models, to capture the idea that they account for more than just “complete spatial randomness”.  Neutral models thus correspond to plausible system states that can be used as a reasonable null hypothesis (e.g. “background variation”) in disease cluster tests. The problem then is to identify spatial patterns not incorporated into the neutral model, enabling, for example, the detection of clusters above and beyond background or regional variation in the risk of developing disease.

A typology of neutral models that account for factors often encountered in analyses of health events defines neutral models type I-VI (Goovaerts and Jacquez 2004).    These neutral models are realistic in that they account for the spatial autocorrelation, non-uniform risk, and spatially heterogeneous population sizes that may be present in the absence of the cluster process.  Model I is Complete Spatial Randomness (CSR), that is still widely used in health analysis even though it usually does not correspond to any plausible state of the system being studied.  Model II reproduces the spatial autocorrelation that may be present in the observed data.  Model III incorporates non-uniform variability in the underlying risk that may be attributable to risk factors and covariates that are not of direct interest.  Models IV through VI account for the impact of population size and variability on the stability of observed rates, and are used to address the small numbers problem.

Neutral models thus play a critical role in scientific inference in disease pattern analysis since they allow one to systematically incorporate different sources of geographic variation, including spatial autocorrelation, into the hypotheses being evaluated.  In the framework of Strong inference, one conducts a series of statistical analyses systematically evaluating each of the hypotheses in the set of alternative explanations for the observed spatial patterns of health events.  These are each incorporated into the neutral model of a given spatial cluster test, and if the test is significant that hypothesis is rejected; if it is not significant that hypothesis is retained in the set of plausible explanations for the observed spatial pattern.

As noted earlier, an alternative mechanism when knowledge of the system is sufficient is to construct more formal, detailed models using regression, geostatistical, and other modeling approaches.  The variability captured by the model is then attributable to the predictor variables, and clustering may then be applied to the regression residuals to quantify spatial pattern not captured by the model itself.  SpaceStat provides both cluster analysis and modeling techniques that are applied automatically to time-dynamic data, enabling more accurate inferences for health events.

References

Goovaerts, P. and G. M. Jacquez (2004). “Accounting for regional background and population size in the detection of spatial clusters and outliers using geostatistical filtering and spatial neutral models: the case of lung cancer in Long Island, New York.” Int J Health Geogr 3(1): 14.

Waller, L. A. and G. M. Jacquez (1995). “Disease models implicit in statistical tests of disease clustering.” Epidemiology 6(6): 584-590.

Part 2: Spatial Autocorrelation and Clusters of Health Events

Posted on by Geoffrey Jacquez, Ph.D.

Part 2 Sources of Spatial Autocorrelation

Summary: This blog presents several of the sources of spatial autocorrelation in health event data.  Many of these could plausibly lead to clusters of health events, others (such as interpolation autocorrelation) may act to “smooth out” true clusters.  How can one identify the cause(s) of a cluster of health events from among these competing explanations?  Part of the solution is the use of neutral models, and that is the topic of my next blog.

This is the second of a two part blog on spatial autocorrelation and clusters of health events.  The first part presented a framework for analyzing disease clusters that builds on the principles of strong inference.  Strong inference involves enumeration of all of the possible explanations of a disease cluster, some may be causal (such as an environmental hazard or exposure linked aetiologically to the health outcome), some may not (such as geographic variation in proportion of cases  reported).  Each of them may result in spatial autocorrelation in health events – spatial patterns such that some areas are higher, and some are lower, and arranged in a manner that is non-random.  One possible arrangement of high values is a cluster – a geographically localized area of elevated risk for the health event.  Hence many of the sources of spatial autocorrelation in health events are potential causes of clusters, or may act to “smooth out” and obscure true clusters.

This raises a very important question. What are the sources of spatial autocorrelation in health events? These may need to be included in the set of explanatory hypotheses for an observed pattern, and include spatial autocorrelation in underlying risk factors, covariates, reporting, diagnosis, health care policies, physician behaviors, and interpolation autocorrelation, among others, as summarized below. This is by no means an exhaustive list, but includes factors that likely should be considered in many spatial analyses of health events.

Multifactorial causes of disease: It is important to recognize that many health outcomes may be caused by several different disease processes, and that a given exposure mechanism may result in different disease outcomes. For example, risk factors for myocardial infarction include genetic predispositions, diet, body weight, exercise habits, medication compliance, and access to care, among others. And specific exposures, such as smoking, are associated with elevated risk for a host of health outcomes, including bladder, throat, and lung cancers; asthma, pneumonia, and emphysema, among others. Spatial autocorrelation in health events may arise whenever the host of factors underlying disease expression are themselves spatially structured. Genetic predispositions for disease may be inherited, giving rise to spatial autocorrelation in disease risk whenever family members cohabitate and tend to live near one another; ambient air pollutant concentrations tend to be highly spatially autocorrelated, and so on.

Comorbidity and competing causes of death: It is unusual for chronic diseases to be the sole disease process occurring in a patient, especially as the age of the subject increases.   This makes sense when one considers the multifactorial nature of most diseases.  At the population level smoking will increase the risk of both lung and bladder cancers; at the individual level a smoker may have comorbid conditions such as emphysema and lung cancer.  The expression of infection processes is often mediated by immune response and the health status of the individual.  Hence risk of infection increases as the physical condition of the individual declines.  Individuals and populations are thus both subject to competing causes of death.  Prior to the advent of antibiotics in the 1940’s respiratory and childhood infections were major sources of mortality in most developed countries.  As antibiotics became widely available the major source of mortality became chronic diseases such as heart conditions and cancer.  These were “unmasked” once respiratory and childhood infections were removed as a competing cause of death. Spatial autocorrelation in health events thus may arise when there is underlying geographic variation in comorbid conditions and/or risks for competing causes of death.

Geographic variation in exposure and behaviors that mediate exposure: Health events associated with environmental exposures are mediated by exposure routes including eating, drinking, breathing, dermal exposures and ionizing radiation.  When considering health outcomes associated with exposure to specific risk factors, such as arsenic, one needs to consider relevant exposure routes and mechanisms, such as consumption of foods and beverages containing biologically active forms of arsenic. Exposure-mediating behaviors are often modifiable risk factors, since what one smokes, drinks and eats are to a certain extent individual choices that can be changed. When evaluating spatial patterns in health outcomes associated with environmental exposures one needs to consider both environmental concentrations as well as the exposure routes whereby the compound under consideration enters the body.   Both of these (environmental concentrations and exposure routes and mechanisms) may themselves be spatially structured.  For example, the amount of water people drink varies with age, decreasing as one gets older, with occupation (farm workers requiring more water than office workers), with altitude and other factors.

Socio-economic and demographic factors: One definition of “covariate” is a variable that has an effect (e.g. is associated with the outcome) that is not of direct interest.  When modeling health events such as disease incidence, socio-economic and demographic factors such as age may be considered as covariates, since age (for example) does not of itself cause disease.  Yet these are of considerable importance when evaluating spatial disease patterns, since the risk of most health outcomes including cancer, heart disease, and infections, is typically associated with socio-economic status, sex, race and age.  One thus may need to account for spatial patterns in covariates when assessing the significance of a clustering of health events.  Rather than asking “are the health events clustered?” one instead may ask “Is there significant clustering of health events above and beyond spatial patterns in covariates?”  Neutral models (described later) have been developed to address this question.

Genetics: Micro-evolutionary processes such as selection, isolation-by-distance, and migration give rise to spatial autocorrelation in genetic structure and genetic variance in geographically distributed populations.  While we often think of human populations as being very well mixed, interbreeding freely over large geographic distances, this is often not the case.  Population genetics in North American, European and Asian populations have been demonstrated to be spatially autocorrelated, and associated with language and dialect (Sokal, Jacquez et al. 1993).  This makes sense when one considers that children speak the language of their parents and family, and that family members tend to live in geographic proximity of one another, even though some may travel far from their homes.  Familial clusters are often observed for many cancers, both because of behavioral factors mediating common exposures such as second hand smoking and diet, but also because of within-family genetic similarity in oncogenes and tumor suppressor genes. For example, one hypothesis for explaining the excess of breast cancer incidence on Long Island is the higher incidence of mutations of BRCA genes in local populations thought to be descended from European populations, where these BRCA mutations are more frequent.  BRCA1 and BRCA2 are tumor suppressor genes, and mutations of these genes are been linked to breast and ovarian cancers.

For infectious diseases the pathogen, whether a virus or a bacteria, undergoes a population bottleneck whenever there is infection transmission to a new susceptible person.  Only a few (e.g. several thousands) of the pathogen may be required for infection to take hold, and together these may have a genetic composition that is quite different from the overall pathogen population.  A mutation that occurs during a bottleneck can become fixed in the pathogen population infecting that host (person).  When this mutation is associated with changes in infection transmission or severity of the infection, it can have important consequences for the spread of infection, as well as for morbidity, mortality and resistance to treatment.  Such mutations can give rise to new pathogen strains, and the occurrence of these strains may be observed as outbreaks of the new strain, initially occurring in localized populations.  This has been documented for diverse infectious diseases including cholera, tuberculosis, HIV and influenza, among others.

Perhaps one of the best known instances of interactions between infection and genetics is selective pressures for the sickle cell trait that foster resistance to malaria infection (Livingstone 1958).  In sickle cell disease the red blood cells are misshapen, leading to circulatory problems, and early death of red blood cells, resulting in anemia.  The disease has a genetic basis, with alleles that code for the sickle cell trait and for abnormal hemoglobin resulting in different forms of the disease of varying severity.  But when one sickle cell allele is present it confers some resistance to malaria infection.  This confers a substantial selective pressure in populations residing in malarial regions.  The sickle cell trait, and sickle cell anemia, thus vary geographically with higher penetration of the sickle cell gene in populations residing where malaria is endemic.

Vector-borne diseases and parasites often have complex life histories, involving infection transmission and amplification among humans and one or more host organisms.  Well-known examples include malaria, Lyme disease, and West Nile virus, among others.  Here, spatial structure in the genetics of the pathogen can arise due to the interactions between population bottlenecks and mutations, as noted above for infectious diseases.  The genetics of the host species can also influence the origin and spread of different pathogen strains.

Environment/vector-pathogen ecology: Environmental patchiness in habitats suitable for vector and host organism survival are important determinants of where and when vector-borne and parasitic infections occur.  In the northeastern and mid-western United States, the white-tailed deer (Odocoileus virginianus) is an important host species for Lyme disease, which is transmitted by a bite from infected blacklegged ticks.  Infection transmission events can only occur where both infected ticks and susceptible people are present.  Blacklegged tick habitat includes wooded, brushy areas that provide food and cover for intermediate host species such as white-footed mice, and white-tailed deer.  But infection transmission to humans only occurs when people are in areas where infected ticks are present and feeding.  Thus the occurrence of Lyme disease is highly associated with geographic overlap of human activity spaces with habitat suitable for both intermediate hosts and the tick itself.  Infection transmission is highly structured temporally as well, occurring in those months when the tick is searching for blood meals in the spring and fall.

Heterogeneity in population density, rate stability, and the small numbers problem: Health events that occur in small areas may be expressed as a rate, such as an incidence or mortality rate. Rates are calculated from a numerator, such as the number of incident lung cancer cases in white males; and a denominator, such as the population at risk (e.g. white males) for lung cancer.  The rate is calculated by dividing the numerator by the denominator, and this is where the “small numbers problem” arises.  The variance in the rate depends critically on the size of the denominator.  When the denominator is small, variance in the rate is high, when the denominator is large, variance in the rate is small.  Hence the appearance of an apparently large rate might be due entirely or in part to the small numbers problem (e.g. a small denominator with a resulting large variance in the rate estimate), and the true, underlying risk might be entirely unremarkable.  A simple protocol for evaluating whether the small numbers problem is having an impact on estimated rates is as follows.  First, create a map of the rate and a scatterplot of the rate (on the x-axis) and the population at risk (on the y-axis).  Next, inspect the scatterplot for the “Greater Than” signature  (e.g. “>”) such that variance in the rate is larger at small population sizes. (Figure 1).  Finally, brush select on the scatterplot to see where the areas with high rates and low population sizes appear on the map.  These are the places with apparent high rates that may be unstable due to the small numbers problem.

Figure 1. Simple diagnostic for the small numbers problem.

Figure 1. Simple diagnostic for the small numbers problem.

A plot of the lung cancer mortality rate for white males (y-axis) versus the square root of the white male population (x-axis) demonstrates the “>” signature, with higher variance in the rate at small population sizes.  Brush selection on the scatterplot (the 3 large red circles in dashed rectangular box) locates the areas with high mortality rates that may be unstable since they have small denominators.  Calculated in BioMedware SpaceStat software.

Variability in rates due to the small numbers problem, if not corrected for, can give rise to artifactual spatial structure in the estimated rates.   For example, the three areas with high rates in brush selected in Figure 1 are high spatial outliers.  When clustering rates it therefore is important to use statistical techniques that either stabilize the rates by constructing local populations with similar denominator sizes, or that account for denominator size when assessing statistical significance.

Interpolation autocorrelation: Smoothing rates in an attempt to adjust for rate instability can introduce spatial autocorrelation due to interpolation.  Smoothing introduces nuisance autocorrelation whenever the kernels used to accomplish the smoothing overlap.  Examples include inverse distance smoothing, empirical Bayesian smoothing, and others.  Here, the spatial scale of the autocorrelation introduced by smoothing will depend on the kernel size.  When assessing clusters it may be inappropriate to cluster rates after first smoothing them, since the smoothing step can introduce the appearance of artifactual local similarity in rates that is attributable to interpolation rather than to underlying disease processes.  One thus may wish to use smoothing when displaying maps of the rates, but employ techniques that explicitly account for denominator size to evaluate clustering.

Access to screening, care and treatment: Access to health care and screening facilities can give rise to spatial autocorrelation in health events since both screening and treatment influence health outcomes.  For example, several studies have demonstrated that access to breast cancer screening facilities is significantly associated with geographic differences in stage at diagnosis, with late-stage cancers more frequent in populations distant from breast cancer screening facilities (Meliker, Goovaerts et al. 2009).  Poorer populations are particularly impacted by access to screening, since availability of transport and travel times may pose barriers to seeking health screening.  An example is the use of mosquito nets, malaria incidence and distance to clinics that distribute the nets (Enayati and Hemingway 2010).   In agrarian rural areas of Malawai with poor roads a distance of 10 kilometers to the nearest clinic where mosquito nets are distributed may involve a full day round trip.  Not surprisingly, studies have demonstrated that households nearer to clinics have higher mosquito net usage rates than households that are distant.  A useful intervention then is to distribute the mosquito tents directly to the households.

Neighborhood/contextual effects: Neighborhood and related contextual effects can have negative impacts on human health status that exceed the impacts of covariates such as socio-economic status and access to care that themselves may vary dramatically from one neighborhood to another (Spielman and Yoo 2009).   Hypotheses suggest that perception of personal safety and quality of the neighborhood living environment can result in chronic stress that leads to reduced immune function and increased disease susceptibility, elevated blood pressure, and heart disease.  One mechanism is the interaction between chronic stress, elevated cortisol and immune system status, such that chronic stressors  are associated with suppression of both cellular and humoral measures of immune system function.  (Segerstrom and Miller 2004).  Neighborhood s thus may be associated with spatial autocorrelation in health effects through direct effects such as socio-economic determinants (e.g. income and health insurance), environmental factors (such as air quality) as well as contextual effects that impact stress and immune function (Li and Chuang 2009).

Differences in response to health care policy: Policies related to health care, treatment, drug development and deployment, and others, can have substantial impacts on health outcomes that may differ from one geographic area to another.  In the United States the states often have a fair amount of flexibility in how they implement national policies.  For example, the Center for Disease Control (CDC) is required to conduct the Behavioral Risk Factor Surveillance System (BRFSS), which is an on-going telephone health survey system, tracking health conditions and risk behaviors in the United States annually since 1984. Data are collected monthly by all 50 states, the District of Columbia, Puerto Rico, the U.S. Virgin Islands, and Guam.  A core portion of the health survey questions come from the CDC, but states can supplement the survey with their own optional modules, and the BRFSS variables may thus vary from one state to another.

In addition, health policies can have differential impacts on physician behaviors that are not immediately apparent when the policies are drafted.  For example, a recent study explored geographic variation in use of physician-administered chemotherapeutic agents under Medicare Part B, in response to a major reform of Medicare’s reimbursement system (Jacobson, Earle et al. 2011) under a new health policy act.  Physician prescription behavior in response to the payment change varied from state to state.  Some states increased treatments with certain chemotherapeutic agents by 4%, and a few actually reduced treatment rates. The state-to-state differences are statistically significant, with the null hypothesis that the change in chemotherapy treatment was the same across states rejected at alpha<0.001 level.

Healthy worker and geographic attractors: The “healthy worker effect” describes the reduced disease risk observed among employed individuals in many industries, and that cuts across different diseases.  This can give the false appearance of no differences in risk between workers employed in a given industry when compared to the larger population, even though substantial occupational risks may be present (Fornalski and Dobrzyński 2010).  Workers tend to follow employment opportunities, and the establishment of large manufacturing facilities can attract a cohort of healthy workers resulting in an apparent deficit of disease risk in neighborhoods where these workers reside.  A related phenomenon is that of the “geographic attractor” that arises after health conditions are diagnosed.  Here, individuals decide to move nearer hospitals, clinics and treatment centers to ease health care access.  When they die, the place of death is recorded as their last known residence, leading to an apparent excess of disease near treatment facilities.

Outbreaks/spread of infection: Infectious diseases transmitted through the air, through sexual contact, fomite transmission, by drinking water contaminated with pathogens, and through other means, often require infected and susceptible individuals to be in close proximity to one another.  This is true for pathogens with limited life-spans outside the human body (e.g. influenza viruses), but is less true for those with a dormant phase that can survive outside the body for extended periods (such as anthrax spores).  For highly infectious pathogens transmitted from person-to-person we may observe an initial outbreak from an index case (the first case to appear in a local population), that is followed by a spatial “wave” of infection that moves outward from the location of the index case.  This may be followed by an endemic phase characterized by the maintenance of lower levels of infection in the population characterized by local outbreaks, or by the infection spreading rapidly and dying out.  Geographic pattern in the spread of infection is mediated by complex interactions between the probability of infection transmission, the contacts between infected and susceptible individuals, the life history of infection including the duration and timing of the infective stage, mobility of infected and susceptible individuals, timing of the rise and waning of immunity, the virulence of infection, as well as other factors (Sattenspiel and Lloyd 2010).

Immunity: When considering spatial autocorrelation in the spread of infectious diseases the geography of immunity can be an important consideration (Gao and Hua 2010).  Issues include the waning of immunity, herd immunity, vaccination behaviors, and vaccine availability and distribution (Funk, Salathé et al. 2010).  When pathogens enter the body the immune system develops antibodies to fight the infection.  Immune response is said to wane as the concentrations of antibodies specific to that pathogen decrease over time.  When immunity has waned sufficiently, the person may then become infected once again.  This process can result in the appearance of clusters where members of a local population are infected, become immune, and then a resurgence of infection as immunity wanes, resulting in space-time patterns in infection.

Vaccination confers immunity without having to undergo a full-blown infection.  Herd immunity is the protection from infection that arises when a sufficiently large proportion of the population has been vaccinated.  Infection transmission halts when enough individuals are vaccinated and immune, conferring protection even to those who have not been vaccinated.  Vaccination itself often follows geographic distribution and adoption patterns.  Hence the vaccine distribution strategy can impact the timing of when immunity is conferred by vaccination and thus the geographic spread of infection.  Vaccination itself can have intriguing side effects in terms of disease ecology.  The eradication of small pox is one of the great public health triumphs of our time, in which the global distribution and administration of the smallpox vaccine eradicated the disease (Alasdair M) in natural populations.  Immunity to smallpox confers partial immunity to a related infection, monkey pox.  Once smallpox was eradicated the smallpox vaccination program stopped, and outbreaks of monkey pox infections are now increasing.  (Rimoin, Mulembakani et al. 2010)

Geographic variation in positional error: That errors in case ascertainment and incomplete reporting can complicate the detection of disease clusters is well known (Kingsley, Schmeichel et al. 2007).  Positional error can also impact cluster detection, in at least two ways (Jacquez 2012).  First, geographic confounding arises when geographic variation in risk factors is associated with geographic variability in positional error.  The potential for this is larger than one might expect as positional error in geocoded place of residences is larger in rural areas, a gradient similar to certain environmental risk factors and socio-economic and demographic variates.  Second, positional error decreases the power to detect true clusters.  Hence our ability to detect clusters from place of residence data will vary geographically when gradients in positional errors are present.

Migration/Latency: Both chronic and infectious diseases have a latency between exposures that lead to the onset of disease and its diagnosis.  For cancers this latency can be a decade or more, for infectious diseases such as influenza it may days.  Because humans are mobile the geographic pattern of where individuals were when they were exposed may differ dramatically from where they are when they are diagnosed.  Consider the example of breast cancer.   A re-analysis of a case control study of breast cancer in Marin County, California, mapped incident cases and controls from 1997 to 1999 as they were enrolled in the study (Jacquez, Barlow et al. 2011).  Breast cancer is a complex disease thought to have long latencies on the order of decades, although a small proportion of cases do appear in childhood and adolescence.  The geographic pattern of where women lived over their life course differs dramatically from where they lived when they were diagnosed (Figure 2).  For many health outcomes, geographic patterns in cases at time of diagnosis may differ dramatically from that observed at disease onset.

Locations of places of residence of breast cancer cases (circles) and controls (plus symbols).

Figure 2. Locations of places of residence of breast cancer cases (circles) and controls (plus symbols).

Geographic locations of place of residence may vary dramatically from that observed at time of diagnosis in Marin county (lower right) to where women lived over their life course in the US (top) and California (lower left). Source: Jacquez, Barlow et al. 2011.

References

Alasdair M, G. “The history of smallpox.” Clinics in Dermatology 24(3): 152-157.

Enayati, A. and J. Hemingway (2010). “Malaria Management: Past, Present, and Future.” Annual Review of Entomology 55(1): 569-591.

Fornalski, K. W. and L. Dobrzyński ( 2010). “The healthy worker effect and nuclear industry workers.” Dose-Response 8(2): 125 – 147.

Funk, S., M. Salathé, et al. (2010). “Modelling the influence of human behaviour on the spread of infectious diseases: a review.” Journal of The Royal Society Interface 7(50): 1247-1256.

Gao, K. and D.-y. Hua (2010). “Effects of immunity on global oscillations in epidemic spreading in small-world networks.” Physics Procedia 3(5): 1801-1809.

Jacobson, M., C. C. Earle, et al. (2011). “Geographic Variation in Physicians’ Responses to a Reimbursement Change.” New England Journal of Medicine 365(22): 2049-2052.

Alasdair M, G. “The history of smallpox.” Clinics in Dermatology 24(3): 152-157.

Enayati, A. and J. Hemingway (2010). “Malaria Management: Past, Present, and Future.” Annual Review of Entomology 55(1): 569-591.

Fornalski, K. W. and L. Dobrzyński ( 2010). “The healthy worker effect and nuclear industry workers.” Dose-Response 8(2): 125 – 147.

Funk, S., M. Salathé, et al. (2010). “Modelling the influence of human behaviour on the spread of infectious diseases: a review.” Journal of The Royal Society Interface 7(50): 1247-1256.

Gao, K. and D.-y. Hua (2010). “Effects of immunity on global oscillations in epidemic spreading in small-world networks.” Physics Procedia 3(5): 1801-1809.

Jacobson, M., C. C. Earle, et al. (2011). “Geographic Variation in Physicians’ Responses to a Reimbursement Change.” New England Journal of Medicine 365(22): 2049-2052.

Jacquez, G. M. (2012). “A research agenda: Does geocoding positional error matter in health GIS studies?” Spatial and SpatioTemporal Epidemiology(In Press).

Jacquez, G. M., J. Barlow, et al. (2011). Residential mobility and breast cancer in Marin County, California. 4th International Cartographic Association Workshop on Geospatial Analysis and Modeling. Simon Fraser University,  Burnaby, Canada

Kingsley, B. S., K. L. Schmeichel, et al. (2007). “An update on cancer cluster activities at the Centers for Disease Control and Prevention.” Environmental Health Perspectives 115(1): 165-171.

Li, Y.-S. and Y.-C. Chuang (2009). “Neighborhood Effects on an Individual’s Health Using Neighborhood Measurements Developed by Factor Analysis and Cluster Analysis.” Journal of Urban Health 86(1): 5-18.

Livingstone, F. B. (1958). “Anthropological Implications of Sickle Cell Gene Distribution in West Africa1.” American Anthropologist 60(3): 533-562.

Meliker, J. R., P. Goovaerts, et al. (2009). “Breast and prostate cancer survival in Michigan: can geographic analyses assist in understanding racial disparities?” Cancer 115(10): 2212-2221.

Rimoin, A. W., P. M. Mulembakani, et al. (2010). “Major increase in human monkeypox incidence 30 years after smallpox vaccination campaigns cease in the Democratic Republic of Congo.” Proceedings of the National Academy of Sciences 107(37): 16262-16267.

Sattenspiel, L. and A. Lloyd (2010). The geographic spread of infectious diseases: models and applications, Princeton University Press.

Segerstrom, S. C. and G. E. Miller (2004). “Psychological Stress and the Human Immune System: A Meta-Analytic Study of 30 Years of Inquiry.” Psychological Bulletin 130(4): 601-630.

Sokal, R. R., G. M. Jacquez, et al. (1993). “Genetic relationships of European populations reflect their ethnohistorical affinities.” Am J Phys Anthropol 91(1): 55-70.

Spielman, S. E. and E.-h. Yoo (2009). “The spatial dimensions of neighborhood effects.” Social Science &amp; Medicine 68(6): 1098-1105.

Part 1: Spatial Autocorrelation and Clusters of Health Events

Posted on by Geoffrey Jacquez, Ph.D.

Part 1 Strong Inference

The Centers for Disease Control as well as state and local health agencies use information on clusters of health events to respond to cluster allegations brought forward by a concerned public; identify impacted local populations (where are communities with excess childhood leukemia?); guide interventions (where are clusters of late stage diagnosis of breast cancer – we may want a screening facility there); and for program evaluation (are clusters of excess colorectal cancer disappearing in response to my intervention?).

Clustering of health events are also used to increase our understanding of disease aetiology, and thus plays a role in spatial epidemiology and medical geography.  We almost always wish to increase our understanding of the causes underlying clusters of health events.  Why are they there, and what has caused them?

To address this question we require a reasonable and sound framework for analyzing clusters of health events, and an understanding of the causes that plausibly might explain disease clusters.  That is the motivation for this 2-part blog.   Part 1 will proposes an analytical framework, called Strong Inference, to guide cluster investigation.  Part 2 then enumerates the possible sources of spatial autocorrelation in health events – those factors that could give rise to health event clustering.  This blog contains excerpts from an essay I wrote for the Springer Handbook of Regional Science.

Scientific inference from patterns of health events

Health event clusters may loosely be defined as statistically significant excesses of health events in space, in time, or in space time.  There also is space-time interaction, as when nearby health events occur at about the same time.   Cluster existence, location and timing can inform decisions regarding different questions, such as:

  1. Is an observed pattern of health events statistically unusual? (Is apparent clustering real?)
  2. Where are populations with elevated disease rates? (Where are local excesses found?)
  3. Are areas with elevated health events found in proximity to geographic features thought to be associated with disease causality? (Is there focused clustering about pollutant sources?)
  4. Is the observed spatial pattern of health events consistent with certain hypothesized disease processes, and not consistent with others (what is the underlying cause)?
  5. Are there reasonable new hypotheses that might explain the observed disease patterns (what is the best explanation for the cluster)?

Several of these questions can be addressed using an inferential process where plausible generating processes for an observed pattern are considered and then excluded.  This can be done in a haphazard fashion, but it usually is best to systematically enumerate the set of plausible hypotheses that might give rise to an observed pattern of health events, and to then exclude members of this set by conducting a series of experiments that may include statistical tests and models for evaluating space-time disease patterns.  This inferential framework seeks to accomplish a mapping of health event patterns to the spatial processes that might give rise to them, and is called Strong inference.

Strong inference for health events

In 1964, Platt coined the term “Strong inference” (Platt 1964) to describe a useful construct for systematically evaluating explanatory hypotheses that plausibly might explain observed patterns in a data set.  It involves first, enumeration of the explanatory hypotheses that might give rise to the pattern; second, formulation of falsifiable predictions that can be used to systematically test  each of these hypotheses; third, undertaking the tests of predictions; and fourth, winnowing out the hypotheses whose corresponding predictions are found to be false.  The remaining hypotheses then must include, or together explain, the observed data patterns. The initial set of explanatory hypotheses may be expanded as the experiments are conducted.  What is key is that the predictions framed for each hypothesis be falsifiable (e.g. can be tested using, for example, a statistic for spatial clustering), and that the set of explanatory hypotheses be properly framed.

Sources of spatial autocorrelation in health events

This raises a very important question.  What are the sources of spatial autocorrelation in health events?  These may need to be included in the set of explanatory hypotheses for an observed pattern, and include spatial autocorrelation in underlying risk factors, covariates, reporting, diagnosis, health care policies, physician behaviors, and interpolation autocorrelation, among others.  This is by no means an exhaustive list, but includes factors that likely should be considered in many spatial analyses of health events.  The sources of spatial autocorrelation in health events is the topic of blog 2 in this series.

Platt, J. (1964).”Strong inference.” Science 146: 347-353.

Directions Magazine: Geocoding Comes to the Forefront

Posted on by Geoffrey Jacquez, Ph.D.

A report from the First International Geospatial Geocoding Conference (IGGC) by Daniel W. Goldberg and Geoffrey M. Jacquez provides an overview of the intensive 2-day information sharing event attended by geocoding users, developers, scientists and researchers.

The First International Geospatial Geocoding Conference (Redlands, California – Dec 6-7, 2011)

Posted on by Geoffrey Jacquez, Ph.D.

By Geoffrey M. Jacquez, PhD (BioMedware) and Daniel Goldberg, PhD (University of Southern California)

This blog provides a quick take on the First International Geospatial Geocoding Conference. I am pleased to write this blog with Dan Goldberg, conference organizer, with each of us providing impressions and lessons learned.

Conference Summary: The 1st International Geospatial Geocoding Conference was held Dec 6-7, 2011, on the campus of Esri in Redlands, California. The organizing committee included representatives from industry and academe. The meeting was funded by a conference grant from the Centers for Disease Control and Prevention (CDC) National Center for Environmental Health (NCEH) and The Agency for Toxic Substances and Disease Registry (ATSDR) under their Public Health Conference Support Program and was sponsored by Esri, Navteq, and the University of Southern California. With Dr. Dan Goldberg of the University of Southern California as the lead organizer, the conference brought together nearly 200 attendees from around the world. Two main themes were pursued, advances in geocoding technology and practice, and geocoding in health. These resulted in two special issues in the journals Transactions in GIS (technology and practice, John Wilson Editor-in-Chief), and Spatial and Spatio-temporal Epidemiology (geocoding in health, Andrew Lawson Editor-in-Chief). The presentations will be posted on the conference website, keep checking should they not be up when you read this blog.

Geoffrey Jacquez: I attended to learn more about geocoding techniques and emerging technologies; and to assess how these might impact geohealth, my area of specialization. Why worry? E-health, Health 2.0 and related initiatives are advancing rapidly (check out the recent mHealth Summit keynote address by Secretary of Health and Human Services Kathleen Sebelius for a summary of recent innovations), and electronic health records are rapidly being adopted. In 2011, 34% of physicians are using electronic health records, with 52% reporting they intend to adopt them soon. The e-health era is clearly upon us. An estimated 85% or more of e-health records contain georeferencing of some kind, typically addresses of patients, clinics, physicians, laboratories and pharmacies. Geocoding converts addresses into geographic coordinates, and these coordinates then are used to calculate disease rates in areas (e.g. counties); to site health screening facilities (e.g. to be sure mammography screening clinics are near populations that will benefit from screening); to identify disease clusters, and for a host of other purposes. Does the process of geocoding impact the decisions made from e-health records? This question I hoped to answer by attending the conference.

The conference opened with a keynote presentation by Don Cooke, considered by many to be one of the fathers of geocoding. Don gave a tour and history of geocoding, from its first uses at the US Census, through DIME and TIGER files, and on to commercialization of the technologies through GDT and other companies. He spoke of the continuing need to focus the technologies in order to accurately encode geographic coordinates, and in passing used the term “baloney filter”. I hadn’t heard this before, and for me it captured the idea of separating the gold from the dross, of determining early on whether something matters, or should be filtered out as “baloney”. And that of course is what we need to do when assessing whether geocoding accuracy makes a difference in e-health: Assess whether geocoding accuracy impacts health policy decisions, or can be filtered out as something we don’t need to worry about.

Over the course of the next 2 days, I attended talks concerned with assessing accuracy and positional error in geocoded coordinates, addressing topics such as error magnitude, sources of geocoding error, propagation of error into geohealth analysis, and the assessment of how geocoding error alters analysis results. By the end of the meeting I was convinced that geocoding errors indeed may alter analysis results, and may be large enough to qualitatively change resulting health policy decisions. Examples cited at the conference include decreased power to detect true disease clusters when geocoding error is present, errors introduced into accessibility metrics, the underestimation of odds ratios used to assess health-environment relationships in epidemiological studies and others. Further examples may be found in the special issue on geocoding in epidemiology. But our present understanding of how geocoding error affects health policy decisions is meager, and a research agenda is needed to better understand the size of the problem, whether poor decisions have been made in the past, and to assure appropriate decisions are made in the future. From my perspective, such a research agenda might address five needs: 1) A lack of standardized, open-access geocoding resources for use in health research; 2)  A lack of geocoding validation datasets that will allow the evaluation of alternative geocoding engines and procedures; 3) A lack of spatially explicit geocoding positional error models; 4) A lack of resources for assessing the sensitivity of spatial analysis results to geocoding positional error;  5) A lack of demonstration studies that illustrate the sensitivity of health policy decisions to geocoding positional error.  See my paper that is appearing in the special issue of Spatial and Spatio-Temporal Epidemiology “A research agenda:  Does geocoding positional error matter in health GIS studies?” for details.

Dan Goldberg:

With the assistance of Geoff and my fellow conference organizing committee members, we had hoped to use the opportunity of the first international conference on geocoding to bring together folks from every aspect of the geocoding spectrum to discuss consistent challenges and emerging opportunities for geocoding research and practice. What occurred exceeded our expectations on every level. The keynotes from Don Cooke and Mark Greninger were both entertaining and extremely informative for a variety of reasons. First off, it was extremely interesting to find out that Don edited the proceedings of a conference on Geocoding hosted by URISA way back in the early days of geocoding; so, this current conference was by no means the first conference on geocoding. Even though I like to consider myself somewhat of an expert on the history of geocoding, Don’s recounting of the many key points and players in the history of the geocoding research, development, and application assured me yet again that there are many people who must be thanked for the processes and tools we take for granted every day – can you even imagine a world without TIGER?

By all accounts, Don’s talk was one of, if not THE, highlight of the conference. However, not to be outdone, Mark’s keynote was extremely motivating and full of exemplar cases of when and how geocoding can go wrong and what impacts these can have on the delivery of critical services. In the context of the eighth largest state in the country (LA County is by many metrics larger than 42 of the 50 states) Mark highlighted the consequences to public safety, government service delivery, and effective representation that can occur should problems arise at any of the many levels of the complex processes of geocoding and address management.

As the program committee had hoped, the parallel paper presentations and lightning talks were quite successful in representing the broad spectrum of geocoding concerns present at the meeting ranging from the impacts on health analysis from incomplete or incorrect geocoding to new application for geocoding emerging data types such as online twitter feeds. A wide variety of presenters from academia, industry, and government representing diverse geographic locations including Brazil, the United Kingdom, Australia, Canada, and the United States all presented their latest work in the production, analysis, representation, visualization, and utilization of geocoded data. Many of the talks focused on the repercussions of poor geocoding in data analysis, while others introduced new methods for improving the quality of geocoded data as well as techniques for enabling useful analysis in the ever-present case that geocoding does not work perfectly for every record in a dataset.

Q and A was lively after every presentation I attended, evidence that the topics discussed were relevant, timely, and familiar to many of those in attendance who shared personal experience, opinion, and in many cases alternative approaches all of which proved fruitful for making strides toward achieving shared goals; scientific, applied, and technical.

The breakout sessions followed a similar two tracked approach; I was in attendance for the Address Standardization and Volunteered Geographic Information sessions. Although quite different in theme, the result of each of these was similar – the identification of challenges and opportunities for research, development, and application of the respective technologies. Discussions were lively and included the views of participants from every career stage and background (professional and geographic), all of which were boiled down to a series of action items for the research and development communities to take head on to tackle the most pressing challenges in each domain.

The final panel session included leaders from many diverse fields including health (David Stinchcomb), Census (Ama Danso), HUD (Jon Sperling), local government (Mark Greninger), data and service providers (Dan Gibbons), and industry (Don Cooke). Led by Christophe Charpentier, the panel discussed the past, present, and future of geocoding technology with respect to what challenges remain to be solved to allow geocoding to continue to advance to serve the needs of researchers, scientist, policy makers, the business community, and local, state, and federal governments and agencies.  Geocoding more than just address data – for example, relative location descriptions such as the location of an injured person in a national park – was a consistent theme as was the need to understand and represent uncertainty in geocoded data.

All in all, this conference was a success by many accounts. Who would have thought 200+ people from around the globe would have come together for a conference on geocoding? Many participants reported that they developed new collaborative relationships with people they had wanted to meet for years, while others took home new ideas for research and development, and still others came away with new techniques for identifying and dealing with problem address data. The most prominent question asked by the majority of attendees however was – when will the Second International Geospatial Geocoding Conference take place?

The small numbers problem part 3: Diagnostics for the small numbers problem

Posted on by Geoffrey Jacquez, Ph.D.

To follow along with the analyses in this blog download and install a trial version of SpaceStat here.

An earlier blog defined the small numbers problem and illustrated that rates calculated with small denominators (e.g. small at-risk populations) have high variance and result in unstable rate estimates that are poor representations of true, underlying risk.  A later blog showed how spatial time series may be used to identify true elevated rates in local areas with small population sizes.   This addressed an important question: How can one distinguish between high rate estimates due to small numbers and high rate estimates attributable to high underlying risk?  In practice, spatial epidemiology and disease surveillance must almost always deal with the small numbers problem.  We want to be able to make inferences in small areas, such as neighborhoods, estimate disease rates, identify disease clusters, build and evaluate models, and undertake health disparities analyses, all while taking into account differences in population sizes.  This blog demonstrates diagnostics for the small numbers problem in SpaceStat.

As one of the first steps in a geohealth analysis, I often start off with four techniques for assessing the small numbers problem:

(1)  the plot of rate vs population,
(2)  evaluating persistence of rates through time using the timeplot,
(3)  the box plot, and
(4)  the variogram cloud

The first two of these were demonstrated in the earlier blogs mentioned above. This blog describes use of the boxplot to evaluate statistical outliers.  Use of the variogram to identify locations with high impact on measures of spatial correlation and variance is the topic of a later blog.

The box plot

Recall from your basic statistics that the boxplot provides a rapid visual summary of the distributional properties of the data.  We’ll explore this using data describing lung cancer mortality through time in in the United States.  The data come from the National Cancer Institute and we’ll be working with lung cancer mortality in white males in State Economic Areas (SEA’s) from 1970 to 1995.  The rates are age-adjusted and reported per 100,000 population.  The data are recorded in 5 year periods.

To follow along, start SpaceStat and load the project: SmallNumbersBlog3 — spatial outlier diagnostics.

Map: Lung Cancer Mortality

Map: Lung Cancer Mortality

When you load the project you should see something like this (above).  Create a boxplot by selecting the boxplot tool   from the tool menu.

Then enter the geography (“SEArates-male”) and dataset name (“WLUNG”) to create a boxplot.

As described in the SpaceStat help, the boxplot provides a quick visual summary of the distributional properties of a variable.  The box plot displays the relationship between key measures that describe your dataset, including the median, the quartiles, maximum and minimum values, and a range of values identified as “outliers”.  The median is the black line that bisects the “box” that gives the box plot it’s name. The first (lower) and third (upper) quartiles, which are the medians of the lower and upper halves of the data, form the lower and upper bounds of the box. The data for individual locations are shown as grey bars inside the box and grey points outside the box. The “whiskers” on the box, the horizontal bars at the top and bottom, are placed at 1.5 x the interquartile range (see the illustration on the image below). Points outside the whiskers are considered outliers. The vertical line on which the box and whiskers sit is the range of the dataset over all times included in your SpaceStat project.

In addition to the box plot, I like to inspect the distribution of the data, and how that distribution changes through time, using a histogram. Create a histogram by selecting the histogram tool

Then enter the geography (“SEArates-male”) and dataset name (“WLUNG”) to create a histogram.

These are time-dependent data, and we would like to see how statistical outliers on the boxplot, the distribution on the histogram, and the map patterns change through time.  To accomplish this, let’s synchronize the boxplot, histogram and map so they animate together.  Select “Window” from the main menu, and “time synchronize all views”.  You now can click on the animation bar on either the map or the box plot to explore how values change through place and time.   Rearrange the map and statistical graphics to look something like this, and click to animate.

Lung cancer mortality - white males

Through time, you’ll see the map becomes redder, the center of the distribution of the histogram increases (shifts right), and the distance between the whiskers on the boxplot grows larger.  These indicate both the average lung cancer mortality and its variance are increasing through time.  But what about the outliers?

Remember the small numbers problem means that variance of disease incidence and mortality rates is larger in areas with small populations.  One would thus expect there to be statistical outliers in locations with small populations.  Is this observed for these data?  To answer this create the scatter plot of “WLUNG” on the y-axis and “Sqrt White Population” on the x-axis, and move the time slider to be 1/1/1970.  Now brush select (move your cursor over and click on) the most extreme high value on the boxplot.  This observation is outside the whisker and is a statistical high outlier.

Lung cancer mortality - white males

Lung cancer mortality - white males

Notice the selected observation is highlighted on the map and the other graphical views.  Zoom in on the map to see where this outlier is.  Then right click on the map and “Inspect this location” to learn the high outlier in white lung mortality rate at 1/1/1970 is Phoenix City, Alabama, with 111.967 deaths per 100,000.  The mean rate in the U.S. was 60.148 deaths per 100,000, (mean value on the box plot and histogram) so this is almost double the mean death rate!  But is this high rate attributable to the small numbers problem, or to a persistent underlying cause?

Looking at the scatter plot of WLUNG vs. Sqrt White Population, we see the population in Phoenix City, Alabama is quite small relative to other SEA’s, and in that part of the plot that has high variance.  Remember from the second blog in this series that we can inspect the persistence of a high or low rate to evaluate whether it might be attributable to the small numbers problem.  If a high or low rate is attributable to small numbers we would expect that value to jump around a good deal through time (e.g. it is a statistical aberration), but if it is due to a true underlying cause it should continue to be high (or low) through time.   Does the high rate in Phoenix City persist through time?  Animate the graphics.  You’ll find that the Phoenix City rate is not statistically unusual in any of the other time periods, leading us to conclude the extraordinary high rate in early 1970 likely is a statistical aberration due to the small numbers problem.

In contrast, select the lowest value on the box plot in 1/1/1970, and then animate the views.  You’ll see this rate, found in Logan, Utah, is always a low statistical outlier.  As described in blog 2 in this series, this likely is due to the large proportion of Mormons who do not smoke.

We’ve now seen how to use the boxplot, frequency histogram, map and statistical brushing to identify outliers.  We’ve also learned how to use persistence through time to assess whether a statistical outlier is attributable to the small numbers problem or to an underlying risk factor or protective factor.

Dr. Pierre Goovaerts at Colloque Environnement-Santé video (en français)

Posted on by Susan Hinton

WebTV video of the plenary address given at the Colloque Environnement-Santé.

BioMedware: November 2011 Newsletter

Posted on by Susan Hinton

Announcing SpaceStat 3.5 with Geostatistical Discretization and Deconvolution

Our new SpaceStat release expands support for kriging methods by adding discretization and deconvolution, providing the ability to perform kriging from one geography to another.  We’ve also introduced a new graph type, the 2D histogram, which lets you explore how objects are distributed with respect to two variables. <read more>

Interview with Pierre Goovaerts (en français)

Posted on by Susan Hinton

An Interview with Pierre Goovaerts, Chief Scientist at BioMedware, on the occasion of his plenary address “The Role of GIS, Geostatistics and Cancer Atlases in Medical Geography & Environmental Epidemiology” delivered during the symposium “Environment and Health”.

Report from the NASA Public Health Program Review

Posted on by Susan Maxwell, Ph.D.

The 2011 NASA Public Health Program Review was held September 14-16 in Santa Fe, New Mexico. This is an annual event where Principal Investigators of projects funded by the NASA Public Health Program are invited to present the status and results of their work. NASA’s Public Health Program focuses on advancing the realization of societal and economic benefits from NASA Earth Science in the areas of infectious disease, emergency preparedness and response, and environmental health (e.g., air quality). The goal of the NASA Public Health Program is to help determine how weather, climate, and other key environmental factors correlate with health, with the overall goal of improving our nation’s health and safety.

Twenty-seven projects were presented at the meeting covering a broad range of health applications: infectious disease (avian influenza, global influenza, malaria, meningitis, zoonotic hemorrhagic fever), water quality (cyanobacterial blooms, microbial contamination, pathogen and nutrient concentrations), air quality (asthma), environmental health (fire, urban heat, dust), and emergency preparedness and response (ocean search and rescue) Slides from the presentations will be posted soon. Presentations given at the 2009 and 2010 meetings are also available.

I presented on the Internet-based Heat Evaluation and Assessment Tool (I-HEAT) – a BioMedware project funded by the NASA Public Health Program to provide health professionals with an advanced geospatial web-based system for preparing and responding to emergency heat events, developing mitigation strategies, and educating the public.

Internet-based Heat Evaluation and Assessment Tool (I-HEAT)

This system will couple demographic and environmental data obtained from Landsat satellite imagery with browser-based software to model and map heat-related health risks at the neighborhood level.

← Older posts