Statistics are powerful tools. Well-implemented, they tease out underlying patterns from the noise of raw data and improve our understanding. But statistics must take care to avoid misstatements. Unfortunately, statistics can also deliberately distort relationships, declaring patterns where none exist. In my AP Statistics classes, I hope my students learn to extract meaning from well-designed studies, and to spot instances of Benjamin Disraeli’s “three kinds of lies: lies, damned lies, and statistics.”
This post explores part of a study published August 12, 2015, exposing what I believe to be examples of four critical ways statistics are misunderstood and misused:
- Not recognizing the distortion power of outliers in means, standard deviations, and in the case of the study below, regressions.
- Distorting graphs to create the impression of patterns different from what actually exists,
- Cherry-picking data to show only favorable results, and
- Misunderstanding the p-value in inferential studies.
I was searching online for examples of research I could use with my AP Statistics classes when I found on the page of a math teacher organization a link to an article entitled, “Cardiorespiratory fitness linked to thinner gray matter and better math skills in kids.” Following the URL trail, I found a description of the referenced article in an August, 2015 summary article by Science Daily and the actual research posted on August 12, 2015 by the journal, PLOS ONE.
As a middle and high school teacher, I’ve read multiple studies connecting physical fitness to brain health. I was sure I had hit paydirt with an article offering multiple, valuable lessons for my students! I read the claims of the Science Daily research summary correlating the physical fitness of 9- and 10-year-old children to performance on a test of arithmetic. It was careful not to declare cause-and-effect, but did say
The team found differences in math skills and cortical brain structure between the higher-fit and lower-fit children. In particular, thinner gray matter corresponded to better math performance in the higher-fit kids. No significant fitness-associated differences in reading or spelling aptitude were detected. (source)
The researchers described plausible connections for the aerobic fitness of children and the thickness of cortical gray matter for each participating child. The study went astray when they attempted to connect their findings to the academic performance of the participants.
Independent t-tests were employed to compare WRAT-3 scores in higher fit and lower fit children. Pearson correlations were also conducted to determine associations between cortical thickness and academic achievement. The alpha level for all tests was set at p < .05. (source)
All of the remaining images, quotes, and data in this post are pulled directly from the primary article on PLOS ONE. The URLs are provided above with bibliographic references are at the end.
To address questions raised by the study, I had to access the original data and recreate the researchers’ analyses. Thankfully, PLOS ONE is an open-access journal, and I was able to download the research data. In case you want to review the data yourself or use it with your classes, here is the original SPSS file which I converted into Excel and TI-Nspire CAS formats.
BEWARE OUTLIERS and MISLEADING SCALES:
My suspicions were piqued when I saw the following two graphs–the only scatterplots offered in their research publication.
Scatterplot 1: Attempt to connect Anterior Frontal Gray Matter thickness with WRAT-3 Arithmetic performance
The right side of the top scatterplot looked like an uncorrelated cloud of data with one data point on the far left seeming to pull the left side of the linear regression upwards, creating a more negative slope. Because the study reported only two statistically significant correlations between the WRAT tests and cortical thickness in two areas of the brain, I was now concerned that the single extreme data point may have distorted results.
My initial scatterplot (below) confirmed the published graph, but fit to the the entire window, the data now looked even less correlated.
In this scale, the farthest left data point (WRAT Arithmetic score = 66, Anterior Frontal thickness = 3.9) looked much more like an outlier. I confirmed that the point exceeded 1.5IQRs below the lower quartile, as indicated visually in a boxplot of the WRAT-Arithmetic scores.
Also note from my rescaled scatterplot that the Anterior Frontal measure (y-coordinate) was higher than any of the next five ordered pairs to its right. Its horizontal outlier location, coupled with its notably higher vertical component, suggested that the single point could have significant influence on any regression on the data. There was sufficient evidence for me to investigate the study results excluding the (66, 3.9) data point.
The original linear regression on the 48 (WRAT Arithmetic, AF thickness) data was . Excluding (66, 3.9), the new scatterplot above shows the revised linear regression on the remaining 47 points: . This and the original equation are close, but the revised slope is 4.6% smaller in magnitude relative to the published result. With the two published results reported significant at p=0.04, the influence of the outlier (66, 3.9) has a reasonable possibility of changing the study results.
Scatterplot 2: Attempt to connect Superior Frontal Gray Matter thickness with WRAT-3 Arithmetic performance
The tightly compressed scale of the second published scatterplot made me deeply suspicious the (WRAT Arithmetic, Superior Frontal thickness) data was being vertically compressed to create the illusion of a linear relationship where one possibly did not exist.
Rescaling the the graphing window (below) made those appear notably less linear than the publication implied. Also, the data point corresponding to the WRAT-Arithmetic score of 66 appeared to suffer from the same outlier-influences as the first data set. It was still an outlier, but now its vertical component was higher than the next eight data points to its right, with some of them notably lower. Again, there was sufficient evidence to investigate results excluding the outlier data point.
The linear regression on the original 48 (WRAT Arithmetic, SF thickness) data points was (above). Excluding the outlier , the new scatterplot (below) had revised linear regression, . This time, the revised slope was 13.6% smaller in magnitude relative to the original slope. With the published significance also at p=0.04, omitting the outlier was almost certain to change the published results.
THE OUTLIER BROKE THE STUDY
The findings above strongly suggest the published study results are not as reliable as reported. It is time to rerun the significance tests.
For the first data set–(WRAT Arithmetic, AF thickness) —run an independent t-test on the regression slope with and without the outlier.
- INCLUDING OUTLIER: For all 48 samples, the researchers reported a slope of -0.007817, , and . This was reported as a significant result.
- EXCLUDING OUTLIER: For the remaining 47 samples, the slope is -0.007460, r=-0.252, and p=0.087. The r confirms the visual impression that the data was less linear and, most importantly, the correlation is no longer significant at .
For the second data set–(WRAT Arithmetic, SF thickness):
- INCLUDING OUTLIER: For all 48 samples, the researchers reported a slope of -0.002767, r=-0.291, and p=0.04. This was reported as a significant result.
- EXCLUDING OUTLIER: For the remaining 47 samples, the slope is -0.002391, r=-0.229, and p=0.121. This revision is even less linear and, most importantly, the correlation is no longer significant for any standard significance level.
In brief, the researchers’ arguable decision to include the single, clear outlier data point was the source of any significant results at all. Whatever correlation exists between gray matter thickness and WRAT-Arithmetic as measured by this study is tenuous, at best, and almost certainly not significant.
THE DANGERS OF CHERRY-PICKING RESULTS:
So, let’s set aside the entire questionable decision to keep an outlier in the data set to achieve significant findings. There is still a subtle, potential problem with this study’s result that actually impacts many published studies.
The researchers understandably were seeking connections between the thickness of a brain’s gray matter and the academic performance of that brain as measured by various WRAT instruments. They computed independent t-tests of linear regression slopes between thickness measures at nine different locations in the brain against three WRAT test measures for a total of 27 separate t-tests. The next table shows the correlation coefficient and p-value from each test.
This approach is commonly used with researchers reporting out only the tests found to be significant. But in doing so, the researchers may have overlooked a fundamental property of the confidence intervals that underlie p-values. Using the typical critical value of p=0.05 uses a 95% confidence interval, and one interpretation of a 95% confidence interval is that under the conditions of the assumed null hypothesis, results that occur in most extreme 5% of outcomes will NOT be considered as resulting from the null hypothesis, even though they are.
In other words, even under they typical conditions for which the null hypothesis is true, 5% of correct results would be deemed different enough to be statistically significant–a Type I Error. Within this study, this defines a binomial probability situation with 27 trials for which the probability of any one trial producing a significant result even though the null hypothesis is correct, is p=0.05.
The binomial probability of finding exactly 2 significant results at p=0.05 over 27 trials is 0.243, and the probability of producing 2 or more significant results when the null hypothesis is true is 39.4%.
That means there is a 39.4% probability in any study testing 27 trials at a p<0.05 critical value that at least 2 of those trials would report a result that would INCORRECTLY be interpreted as contradicting the null hypothesis. And if more conditions than 27 are tested, the probability of a Type I Error is even higher.
Whenever you have a large number of inference trials, there is an increasingly large probability that at least some of the “significant” trials are actually just random, undetected occurrences of the null hypothesis.
It just happens.
THE ELUSIVE MEANING OF A p-VALUE:
For more on the difficulty of understanding p-values, check out this nice recent article on FiveThirtyEight Science–Not Even Scientists Can Easily Explain P-Values.
Personally, I’m a little disappointed that this study didn’t find significant results. There are many recent studies showing the connection between physical activity and brain health, but this study didn’t achieve its goal of finding a biological source to explain the correlation.
It is the responsibility of researchers to know their studies and their resulting data sets. Not finding significant results is not a problem. But I do expect research to disclaim when its significant results hang entirely on a choice to retain an outlier in its data set.
Chaddock-Heyman L, Erickson KI, Kienzler C, King M, Pontifex MB, Raine LB, et al. (2015) The Role of Aerobic Fitness in Cortical Thickness and Mathematics Achievement in Preadolescent Children. PLoS ONE 10(8): e0134115. doi:10.1371/journal.pone.0134115
University of Illinois at Urbana-Champaign. “Cardiorespiratory fitness linked to thinner gray matter and better math skills in kids.” ScienceDaily. http://www.sciencedaily.com/releases/2015/08/150812151229.htm (accessed December 8, 2015).