A curious case of cherry-picking data for the greater good
Alexey Goldin
A large number of websites (for example, see this) recently furiously discussed a finding by Janette Sherman and Joseph Mangano claiming that radioactive fallout from Fukushima resulted in a statistically significant increase in child mortality. As in my daily job I rely very heavily on being able to learn some facts around the world around from very noisy data using statistical methods, I got curious. I was especially interested to learn what "statistically significant" means.
I am not a statistician but I play one at work. If a real statistician will analyze these data in a more careful way I'd be be grateful, but I hope my analysis is rigorous enough. It is overkill for debunking, but I hope it can be used as an introduction to R for some curious people who might be interested to check the next outrageous claims for themselves and demonstrate R capabilities for analysis.
The data are available from the Center for Disease Control, but not in a very convenient format. Great thanks to the Capacity Factor blog for compiling the data for 8 cities used in the Janette Sherman and Joseph Mangano analysis. The csv file childMort.csv is included in attached zip file. To repeat my analysis you do not need anything beyond opensource statistical package R and some standard libraries (lubridate, xts, plyr, sandwich,lmtest), installable by using the install.packages function.
Using functions defined in the included file mort.R one can read data like this:
> childM <- get.data("childMort.csv")
The average mortality for selected 8 cities for 10 weeks after Fukushima:
> dp10 <- mean(childM$Deaths["2011-03-20::2011-05-28"])
> dp10
12.5
And before:
> dm4 <- mean(childM$Deaths["2011-02-26::2011-03-19"])
> dm4
9.25
The plot of mortality vs time:

Indeed, if you squint enough you'll see that post-Fukushima mortality is higher. It looks grim. However it is not obvious why the authors of the study use 10 weeks after March 19 but only 4 before. There is a lot of data available before and using more data always allows you to estimate averages with better precision. If we use all 12 weeks of 2011 before March 19, surely we can get a better estimate of average mortality then using only 4 weeks. Let's try:
> dm12 <- mean(childM$Deaths["2011-01-01::2011-03-19"])
> dm12
12.25
12.25 is much closer to
post-Fukushima average weekly value of 12.5.
Let's not stop here and use all of the data available, starting from
2007:
> dm2007 <- mean(childM$Deaths["2007-01-01::2011-03-19"])
> dm2007
13.7899543378995
It definitely looks like average post-Fukushima mortality is lower. You could say "But it is just a statistical fluke!" and would be (partially) right.

Partially, because there is a clear downward trend in child mortality which Fukushima (visually) does not seem to influence very much. But we will return to this trend a bit later.
Let's see if there is more evidence for conclusion-based evidence drawing. If you loaded my code you should have dataframes "cities" and "cities.cdc" defined in your workspace. Here are the cities used in Sherman and Mangano analysis.
> library(maps)
> map(database = "state", regions = c("CA", "OR", "WA", "ID"))
> points(cities.cdc$long, cities.cdc$lat, pch = 16, col = "blue")
> points(cities$long, cities$lat, pch = 16, col = "red")

Blue dots show the cities available in CDC database but not included in the study. Red dots are the cities in the study. Usually every scientist is trying to use as much data as available. There might be a reason for omitting cities south of San Jose, but what are the reasons for omitting Spokane, WA and Tacoma, WA, but including Boise, ID? What is the reason for this?
I downloaded 2011 data for both cities (if you load my code you'll have variables spokane and tacoma). The averages after Fukushima and before:
> mean((spokane + tacoma)["2011-03-20::2011-05-28"])
1.5
> mean((spokane + tacoma)["2011-02-26::2011-03-19"])
1.66666666666667
And it looks like the average death rate is decreasing for these two cities. This might be a good reason to not include them.
At this point it looks quite likely that Sherman and Mangano picked up their indicator (ratio of average death rate for 10 weeks later after the event to average rate for 4 weeks before the event) to be able to claim relationship between child mortality and Fukushima event that may not be there. But let's investigate how good their indicator is. We suspect that they artificially constructed this indicator to spike on March 19. Would it spike on other dates?
The following function averages mortality on next 10 weeks, previous 4 weeks and returns the value. When its value reaches 1.35, average 10 week mortality exceeded previous 4 weeks mortality by 35%.
> tr.indicator <- function(lr, nback = 4, nforward = 10) {
> lr4 <- rollapply(lr$Deaths, nback, mean, align = "right",
> na.pad = TRUE)
> lr10 <- lag(rollapply(lr$Deaths, nforward, mean, align = "left",
> na.pad = TRUE), 1)
> as.xts(lr10/lr4)
> }
> rind <- tr.indicator(childM)
> plot(rind, main = "Sherman-Mangano indicator")
> abline(h = 1.35)

The horizontal line is exactly where value of indicator is 1.35, or where the average number of deaths for the next 10 weeks is higher then average number of deaths for the previous 4 weeks. You see that it spikes at March 19, 2011 but it also spikes (and sometimes significantly higher) on numerous other occasions.
What about Jan 2011? And what about late 2009? Where infants dying in large numbers then? Could something be happening that we missed? Were there numerous unannounced reactors melting down without anyone noticing? Why Sherman and Mangano were silent on Jan 2011? Or maybe Jan 2011 (just like March 19) was just a statistical fluke?
Let's try another approach. We take our data and resample them, by taking them in random order. Any time dependent pattern will be scrambled, and whatever remains after applying the same 10 week - 4 week indicator is just an artifact of random distribution.
> set.seed(1)
> s1 <- sample(as.numeric(childM$Deaths), nrow(childM), replace = TRUE)
> sxts1 <- xts(s1, order.by = index(childM))
> names(sxts1) <- "Deaths"
> rind1 <- tr.indicator(sxts1)
> plot(rind1, main = "Sherman-Mangano indicator for random sample")
> abline(h = 1.35)

It is obvious that this indicator very often sets off alarms when there is nothing to predict at all, the data we fed to it were random by design.
Let's try to estimate the probability of seeing this indicator exceeding some value of 1.35 on any particular day. We can do it with the following simple method. Let's generate a thousand random sequences which are 14 weeks long with numbers drawn from observed mortality data, calculate our indicator and count how often we exceed value of 1.35:
> mk.boot14 <- function(cm) {
> d1 <- sample(as.numeric(cm$Deaths), 14, replace = TRUE)
> d14 <- as.xts(d1, order.by = index(cm)[1:14])
> names(d14) <- "Deaths"
> d14
> }
> set.seed(2)
> N <- 1000
> res <- rep(NA, N)
> for (i in 1:N) {
> d14 <- mk.boot14(childM)
> res[i] <- coredata(tr.indicator(d14)[4, 1])
> }
> plot(density(res), main = "Sherman-Mangano indicator density distribution")
> abline(v = 1.35)
> s <- sum(res >= 1.35)/N
> s
0.062

A total 0.062 of all samples in a completely random sequence have next 10 weeks average 35% higher then previous 4 weeks average. This is not considered significant even when researcher is not hand picking data subset (cities) which to analyze and not picking his indicator to peak at predetermined date which is probably what was happening (consciously or unconsciously) in this Sherman-Mangano "study".
In fact with significance of 0.062 by selecting from 16 independent samples one is almost guaranteed to find a sample which will demonstrate 35% higher death rate in next 10 weeks compared to previous 4.
Can we take into account the general very fortuitous downward trend in infant mortality? We could modify our bootstrap procedure to take it into account but for educational purposes we will take another approach, described in this great tutorial. The data follows Poisson distribution, therefore we are using Poisson regression to estimate trend and impact of Fukushima factor.
We put our data in dataframe, add a Fukushima factor equal to 1 after March 19, 2011 and 0 before and add a column for linear trend.
> childM$F <- 0
> childM$F["2011-03-20::2011-06-12"] <- 1
> dt <- data.frame(childM)
> ind <- index(childM)
> trend <- (julian(ind) - julian(ind[1]))/365.25
> dt$trend <- trend
> fit <- glm(formula = Deaths ~ trend + F, family = poisson, data = dt)
> library(lmtest)
> library(sandwich)
Here is the fit summary via coeftest:
> ct <- coeftest(fit, vcov = sandwich)
Mortality data fit results
| | Estimate | Std. Error | z value | Pr(> |z|) |
| (Intercept) | 2.82 | 0.04 | 78.51 | 0.00 |
| trend | -0.09 | 0.02 | -5.90 | 0.00 |
| F | 0.08 | 0.10 | 0.81 | 0.42 |
We see that while trend (fortunately) has a very high significance (better then 1e-8), significance of the F (Fukushima) term is 42 %. This means that without anything happening at all on March 19 we would have this value for coefficient F or higher with a probability of 42 % due to simply random effects. This is because we used not only 4 weeks before the event, but all available data to estimate background death rate. Intercept is a natural log of average death rate.
To estimate average mortality for weeks following Fukushima disaster using all available data and taking into account downward trend we can fit the model using all before-Fukushima data and use this model to predict averages for post-Fukushima. We use argument "weights" to limit our model to the period when F=0 (before Fukushima) and do not use F term in this model:
> w <- 1 - dt$F
> fit.0 <- glm(formula = Deaths ~ trend, family = poisson, data = dt,
> weights = w)
> pr <- predict(fit.0, data = dt, type = "response")
> mean(pr[dt$F > 0])
11.1069765487832
The number above is the expected average weekly infant mortality assuming linear trend is correct. Of course, linear trend can not last forever as it would mean mortality would become negative. But it may be a reasonable estimate for a short term
( if you play with the data try to fit quadratic model as well, using additional term I(trend^2). )
So the expected average daily mortality using all data from 2007 is, unfortunately, higher then which we were led to believe. Post-Fukushima excess is much less significant, which is demonstrated in the significance table above.
To plot original mortality data together with fits:
> plot(childM$Deaths)
> prd <- xts(predict(fit, type = "response"), order.by = index(childM))
> lines(prd, col = "red")
> prd.0 <- xts(predict(fit.0, type = "response", data = dt), order.by = index(childM))
> lines(prd.0, col = "blue")

The model with Fukushima factor is plotted with red color, without it -- with blue. They are very close, overlapping everywhere but after Fukushima. You can see that there is just not enough data to see the difference.
At this point it is worthwhile to question either the scientific integrity or statistical competence of Sherman and Mangano. They might be decent people and believe in what they say, but allow themselves to say "small lies" in a service of "Greater Truth". This never ends up well. Because they are likely to kill some unstable people with their small lies.
Now, if someone could find me the data used to "prove" that there are leukemia clusters next to nuclear plants....
The promised attached zip with the data and code is here.