Statistics may be defined as “a body of methods for making wise decisions in the face of uncertainty.”

~W.A. Wallis

Statistics are crucial to the skeptical position on so many topics, even though it can be frustrating to explain why to people. They provide us with a very powerful toolbox to separate what we wish to see or what “common sense” tells us, from what is actually true. But reality is ugly and messy, and anecdotes are simpler to understand. On Skeptic North alone, we’ve mentioned statistics in relation to hockey, H1N1, blood donation, public transit, and homeopathy, to name a few.

But even when we understand why our data are better then someone else’s, it can be hard to explain why. So, here’s a refresher on some basic statistical methods.

Part 1 – The Cookie War (or What’s a Confidence Interval?)

Say you’re an aspiring skeptic in small town Alberta. The current argument in the long standing feud between the two bakers in town has erupted over who has the most chocolate chips in his cookies. You decide that someone should put an end to the matter.

The Method

The first thing you need is a way to measure the chocolate chip density (CCD) per cookie. You could just count the chips, but what if one baker makes larger cookies? For this example, let’s say that the chosen method is number of chocolate chips per 50g cookie. And, to avoid the claim that one of the bakers just had a bad day, let’s say that you collect two cookies per day from each baker for a month.

Many types of data in life, especially when dealing with large populations, can be assumed to be the shape of a bell curve. The height of everyone in the population is an example of this. There’s the average height, a range on either side that we would still consider “normal,” and there are fewer very tall or very short people the further you get from the average. If you count the number of people at each height and graph it, you get a curve.

This is probably a reasonable approximation for cookies. There will be an average CCD, but some will have more chocolate chips and some will have fewer. Very rarely, you’ll see a cookie that has no chocolate chips, or is almost all chocolate chips.

The Average

This one is pretty simple to explain to people, and it’s useful. For this study*, Baker A has an average CCD of 7.35 and Baker B has an average CCD of 7.48. If you were to hear this study on the radio, you’d likely hear that Baker B has more chocolaty cookies. But that isn’t the whole story, and may not be true.

The Standard Deviation and Confidence Interval

In newspapers, you’ll usually find a statement about a poll that says, “This poll is accurate to within 3%, 19 times out of 20.” This is very important from a skeptical perspective. In a bell curve, there’s a value called the standard deviation. It’s based on how spread out the numbers are or the population size, depending on the type of data you have.

Obviously, we haven’t sampled the entire population (e.g., every cookie made by each baker). If we had, we wouldn’t need statistics to figure out who has the most chocolate chips. However, if you take the average and 2 standard deviations on either side of it, you have a 95% confidence interval. This means that if we repeat the study again and again, 95% of the studies will have confidence intervals that overlap with whatever the real average is.

To make life easier, I’ll just tell you that the standard deviation for Baker A is 1.05 and for Baker B it’s 1.24. We can say that the 95% confidence interval for Baker A’s cookies is between 5.25 and 9.46, and Baker B’s is between 4.99 and 9.97. Another way that it’s stated more commonly would be, “Baker A’s cookies have a chocolate chip density of 7.35. This is considered accurate plus or minus 1.05, 19 times out of 20.”

This “19 times out of 20″ is important. Focusing on just Baker A for a moment, if you repeat this experiment a total of 20 times, at least 19 of those results should have confidence intervals that overlap. The experiment that was the odd man out would likely be close, but not overlapping. The experiment we just did could very well be the one that’s wrong. If you took those 20 sets of results, you could perform a meta-analysis and come up with an even smaller confidence interval than any individual experiment could.

The Results

From this cookie experiment, Baker A and Baker B are tied. Baker B’s confidence interval overlaps with all of Baker A’s confidence interval, and so you cannot (honestly) say that one has a more chocolate chips than the other. However, we can say that if you want more consistent cookies, Baker A is the one to go to.

Of course, this doesn’t settle the feud, so you’ll want to repeat the experiment, because all good science should be repeatable.

The Take Home Message

Think about this the next time that you see a poll for the popularity of political parties. If two of the parties have confidence intervals that overlap, the poll is reporting a tie. If you see a study for some form of new medical treatment that was tested against a placebo, and the confidence interval for the placebo and the medicine overlap, then this particular study has shown that the medicine has no effect.

* No, I didn’t just make these numbers up. They’re derived from ages in the February 2007 Major League Baseball roster. Because, while I find baseball boring, they do keep good statistics.

There is some confusion here, stemming from the question of whether you are conducting a one sample test or a two sample test. To cut the following long story short, you should not be looking at whether the two confidence intervals overlap. That is wrong!

OK – the long story:

In a one sample test, you want to know whether a given parameter (for example the CCD for Baker A) can be concluded to be different from known fixed value. For example, suppose that the International Brotherhood of Cookie Bakers (IBCB) insist that only bakers whose CCD is demonstrably above 5 may qualify for membership. Then since the 95% CI for Baker A’s CCD is completely above 5, then Baker A may proudly advertise as a MIBCB. Baker B may not, since the data cannot rule out a CCD of 5.

Formally, we would be conducting two separate single sample one tail tests of the null hypothesis that CCD is less than or equal to 5. We would reject the null hypothesis for Baker A, but not for Baker B.

However, if we wish to compare Baker A’s cookies with Baker B directly, we need to perform a two sample test. We do this by constructing a confidence interval for the difference between Baker A’s CCD and Baker B’s CCD. If this CI contains 0, then we will not be able to reject the null hypothesis that the two bakers have the same CCD.

Since we are assuming normality (and independence), the difference should also be distributed normally. Our point estimate of the mean of this distribution is 7.35-7.48=-0.13. The standard deviation is 1.95 (the square root of the sum of the variances of the two distributions). Our confidence interval for the difference is therefore -0.13 plus or minus 1.95*1.95=(-3.92,3.67). Since this interval contains 0, we cannot conclude that either baker has a higher CCD than the other.

The point is that the matter of whether the two confidence intervals overlap is irrelevant. That method is too conservative! if you use that method, you will make mistakes, concluding that there is no difference when in fact there is a statistically significant difference!

(This doesn’t necessarily apply when the two samples are not independent. In general, when calculating the variance (or standard deviation) of the difference, you need to consider the covariance of the two distributions. For example, in the case of a poll of two political parties (with no undecideds), the covariance of the two statistics found can be assumed to be -1. This is because the noise in the estimate of the first party’s popularity is equal and opposite to the noise in the estimate of the second party’s support. In such a case, it is OK to ask whether the two confidence intervals overlap. But this situation doesn’t arise very often).