Tag Archives: statistics

Stats Exploration Yields Deeper Understanding

or “A lesson I wouldn’t have learned without technology”

Last November, some of my AP Statistics students were solving a problem involving a normal distribution with an unknown mean.  Leveraging the TI Nspire CAS calculators we use for all computations, they crafted a logical command that should have worked.  Their unexpected result initially left us scratching heads.  After some conversations with the great folks at TI, we realized that what at first seemed perfectly reasonable for a single answer, in fact had two solutions.  And it took until the end of this week for another student to finally identify and resolve the mysterious results.  This ‘blog post recounts our journey from a questionable normal probability result to a rich approach to confidence intervals.

THE INITIAL PROBLEM

I had assigned an AP Statistics free response question about a manufacturing process that could be manipulated to control the mean distance its golf balls would travel.  We were told that the process created balls with a normally distributed distance of 288 yards and a standard deviation of 2.8 yards.  The first part asked students to find the probability of balls traveling more than an allowable 291.2 yards.  This was straightforward.  Find the area under a normal curve with a mean of 288 and a standard deviation of 2.8 from 291.2 to infinity.  The Nspire (CAS and non-CAS) syntax for this is:

golf1

[Post publishing note: See Dennis’ comment below for a small correction for the non-CAS Nspires.  I forgot that those machines don’t accept “infinity” as a bound.]

As 12.7% of the golf balls traveling too far is obviously an unacceptably high percentage, the next part asked for the mean distance needed so only 99% of the balls traveled allowable distances.  That’s when things got interesting.

A “LOGICAL” RESPONSE RESULTS IN A MYSTERY

Their initial thought was that even though they didn’t know the mean, they now knew the output of their normCdf command.  Since the balls couldn’t travel a negative distance and zero was many standard deviations from the unknown mean, the following equation with representing the unknown mean should define the scenario nicely.

golf2

Because this was an equation with a single unknown, we could now use our CAS calculators to solve for the missing parameter.

golf3

Something was wrong.  How could the mean distance possibly be just 6.5 yards?  The Nspires are great, reliable machines.  What happened?

I had encountered something like this before with unexpected answers when a solve command was applied to a Normal cdf with dual finite bounds .  While it didn’t seem logical to me why this should make a difference, I asked them to try an infinite lower bound and also to try computing the area on the other side of 291.2.  Both of these provided the expected solution.

golf4

The caution symbol on the last line should have been a warning, but I honestly didn’t see it at the time.  I was happy to see the expected solution, but quite frustrated that infinite bounds seemed to be required.  Beyond three standard deviations from the mean of any normal distribution, almost no area exists, so how could extending the lower bound from 0 to negative infinity make any difference in the solution when 0 was already \frac{291.2}{2.8}=104 standard deviations away from 291.2?  I couldn’t make sense of it.

My initial assumption was that something was wrong with the programming in the Nspire, so I emailed some colleagues I knew within CAS development at TI.

GRAPHS REVEAL A HIDDEN SOLUTION

They reminded me that statistical computations in the Nspire CAS were resolved through numeric algorithms–an understandable approach given the algebraic definition of the normal and other probability distribution functions.  The downside to this is that numeric solvers may not pick up on (or are incapable of finding) difficult to locate or multiple solutions.  Their suggestion was to employ a graph whenever we got stuck.  This, too, made sense because graphing a function forced the machine to evaluate multiple values of the unknown variable over a predefined domain.

It was also a good reminder for my students that a solution to any algebraic equation can be thought of as the first substitution solution step for a system of equations.  Going back to the initially troublesome input, I rewrote normCdf(0,291.2,x,2.8)=0.99 as the system

y=normCdf(0,291.2,x,2.8)
y=0.99

and “the point” of intersection of that system would be the solution we sought.  Notice my emphasis indicating my still lingering assumptions about the problem.  Graphing both equations shone a clear light on what was my persistent misunderstanding.

golf5

I was stunned to see two intersection solutions on the screen.  Asking the Nspire for the points of intersection revealed BOTH ANSWERS my students and I had found earlier.

golf6

If both solutions were correct, then there really were two different normal pdfs that could solve the finite bounded problem.  Graphing these two pdfs finally explained what was happening.

By equating the normCdf result to 0.99 with FINITE bounds, I never specified on which end the additional 0.01 existed–left or right.  This graph showed the 0.01 could have been at either end, one with a mean near the expected 284 yards and the other with a mean near the unexpected 6.5 yards.  The graph below shows both normal curves with the 6.5 solution having an the additional 0.01 on the left and the 284 solution with the 0.01 on the right.

golf7

The CAS wasn’t wrong in the beginning.  I was.  And as has happened several times before, the machine didn’t rely on the same sometimes errant assumptions I did.  My students had made a very reasonable assumption that the area under the normal pdf for the golf balls should start only 0 (no negative distances) and inadvertently stumbled into a much richer problem.

A TEMPORARY FIX

The reason the infinity-bounded solutions didn’t give the unexpected second solution is that it is impossible to have the unspecified extra 0.01 area to the left of an infinite lower or upper bound.

To avoid unexpected multiple solutions, I resolved to tell my students to use infinite bounds whenever solving for an unknown parameter.  It was a little dissatisfying to not be able to use my students’ “intuitive” lower bound of 0 for this problem, but at least they wouldn’t have to deal with unexpected, counterintuitive results.

Surprisingly, the permanent solution arrived weeks later when another student shared his fix for a similar problem when computing confidence interval bounds.

A PERMANENT FIX FROM AN UNEXPECTED SOURCE

I really don’t like the way almost all statistics textbooks provide complicated formulas for computing confidence intervals using standardized z- and t-distribution critical scores.  Ultimately a 95% confidence interval is nothing more than the bounds of the middle 95% of a probability distribution whose mean and standard deviation are defined by a sample from the overall population.  Where the problem above solved for an unknown mean, on a CAS, computing a confidence interval follows essentially the same reasoning to determine missing endpoints.

My theme in every math class I teach is to memorize as little as you can, and use what you know as widely as possible.  Applying this to AP Statistics, I never reveal the existence of confidence interval commands on calculators until we’re 1-2 weeks past their initial introduction.  This allows me to develop a solid understanding of confidence intervals using a variation on calculator commands they already know.

For example, assume you need a 95% confidence interval of the percentage of votes Bernie Sanders is likely to receive in Monday’s Iowa Caucus.  The CNN-ORC poll released January 21 showed Sanders leading Clinton 51% to 43% among 280 likely Democratic caucus-goers.  (Read the article for a glimpse at the much more complicated reality behind this statistic.)  In this sample, the proportion supporting Sanders is approximately normally distributed with a sample p=0.51 and sample standard deviation of p of \sqrt((.51)(.49)/280)=0.0299.  The 95% confidence interval is the defined by the bounds containing the middle 95% of the data of this normal distribution.

Using the earlier lesson, one student suggested finding the bounds on his CAS by focusing on the tails.

golf8

giving a confidence interval of (0.45, 0.57) for Sanders for Monday’s caucus, according to the method of the CNN-ORC poll from mid-January.  Using a CAS keeps my students focused on what a confidence interval actually means without burying them in the underlying computations.

That’s nice, but what if you needed a confidence interval for a sample mean?  Unfortunately, the t-distribution on the Nspire is completely standardized, so confidence intervals need to be built from critical t-values.  Like on a normal distribution, a 95% confidence interval is defined by the bounds containing the middle 95% of the data.  One student reasonably suggested the following for a 95% confidence interval with 23 degrees of freedom.  I really liked the explicit syntax definition of the confidence interval.

golf9

Alas, the CAS returned the input.  It couldn’t find the answer in that form.  Cognizant of the lessons learned above, I suggested reframing the query with an infinite bound.

golf10

That gave the proper endpoint, but I was again dissatisfied with the need to alter the input, even though I knew why.

That’s when another of my students spoke up to say that he got the solution to work with the initial commands by including a domain restriction.

golf11

Of course!  When more than one solution is possible, restrict the bounds to the solution range you want.  Then you can use the commands that make sense.

FIXING THE INITIAL APPROACH

That small fix finally gave me the solution to the earlier syntax issue with the golf ball problem.  There were two solutions to the initial problem, so if I bounded the output, they could use their intuitive approach and get the answer they needed.

If a mean of 288 yards and a standard deviation of 2.8 yards resulted in 12.7% of the area above 291.2, then it wouldn’t take much of a left shift in the mean to leave just 1% of the area above 291.2. Surely that unknown mean would be no lower than 3 standard deviations below the current 288, somewhere above 280 yards.  Adding that single restriction to my students’ original syntax solved their problem.

golf13

Perfection!

CONCLUSION

By encouraging a deep understanding of both the underlying statistical content AND of their CAS tool, students are increasingly able to find creative solutions using flexible methods and expressions intuitive to them.  And shouldn’t intellectual strength, creativity, and flexibility be the goals of every learning experience?

 

PowerBall Redux

Donate to a charity instead.  Let me explain.
The majority of responses to my PowerBall description/warnings yesterday have been, “If you don’t play, you can’t win.”  Unfortunately, I know many, many people are buying many lottery tickets, way more than they should.
 
OK.  For almost everyone, there’s little harm in spending $2 on a ticket for the entertainment, but don’t expect to win, and don’t buy multiple tickets unless you can afford to do without every dollar you spend. I worry about those who are “investing” tens or hundreds of dollars on any lottery.
Two of my school colleagues captured the idea of a lottery yesterday with their analogies,
Steve:  Suppose you go to the beach and grab a handful of sand and bring it back to your house.  And you do that every single day. Then your odds of winning the powerball are still slightly worse than picking out one particular grain of sand from all the sand you accumulated over an entire year.
Or more simply put from the perspective of a lottery official, 
Patrick:  Here’s our idea.  You guys all throw your money in a big pile.  Then, after we take some of it, we’ll give the pile to just one of you.
WHY YOU SHOULDN’T BUY MULTIPLE TICKETS:
For perspective, a football field is 120 yards long, or 703.6 US dollars long using the logic of my last post. Rounding up, that would buy you 352 PowerBall tickets. That means investing $704 dollars would buy you a single football field length of chances in 10.5 coast-to-coast traverses of the entire United States.  There’s going to be an incredibly large number of disappointed people tomorrow.
MORAL:  Even an incredibly large multiple of a less-than-microscopic chance is still a less-than-microscopic chance.
BETTER IDEA: Assume you have the resources and are willing to part with tens or hundreds of dollars for no likelihood of tangible personal gain.  Using the $704 football example, buy 2 tickets and donate the other $700 to charity. You’ll do much more good.

PowerBall Math

Given the record size and mania surrounding the current PowerBall Lottery, I thought some of you might be interested in bringing that game into perspective.  This could be an interesting application with some teachers and students.

It certainly is entertaining for many to dream about what you would do if you happened to be lucky enough to win an astronomical lottery.  And lottery vendors are quick to note that your dreams can’t come true if you don’t play.  Nice advertising.  I’ll let the numbers speak to the veracity of the Lottery’s encouragement.

PowerBall is played by picking any 5 different numbers between 1 & 69, and then one PowerBall number between 1 & 26.  So there are nCr(69,5)*26=292,201,338 outcomes for this game.  Unfortunately, humans have a particularly difficult time understanding extremely large numbers, so I offer an analogy to bring it a little into perspective.

  • The horizontal width of the United States is generally reported to be 2680 miles, and a U.S. dollar bill is 6.14 inches wide.  That means the U.S. is approximately 27,655,505 dollar bills wide.
  • If I have 292,201,338 dollar bills (one for every possible PowerBall outcome), I could make a line of dollar bills placed end-to-end from the U.S. East Coast all the way to the West Coast, back to the East, back to the West, and so forth, passing back and forth between the two coasts just over 10.5 times.
  • Now imagine that exactly one of those dollar bills was replaced with a replica dollar bill made from gold colored paper.

 

Your chances of winning the PowerBall lottery are the same as randomly selecting that single gold note from all of those dollar bills laid end-to-end and crossing the entire breadth of the United States 10.5 times. 

Dreaming is fun, but how likely is this particular dream to become real?

Play the lottery if doing so is entertaining to you, but like going to the movie theater, don’t expect to get any money back in return.

Mistakes are Good

Confession #1:  My answers on my last post were WRONG.

I briefly thought about taking that post down, but discarded that idea when I thought about the reality that almost all published mathematics is polished, cleaned, and optimized.  Many students struggle with mathematics under the misconception that their first attempts at any topic should be as polished as what they read in published sources.

While not precisely from the same perspective, Dan Teague recently wrote an excellent, short piece of advice to new teachers on NCTM’s ‘blog entitled Demonstrating Competence by Making Mistakes.  I argue Dan’s advice actually applies to all teachers, so in the spirit of showing how to stick with a problem and not just walking away saying “I was wrong”, I’m going to keep my original post up, add an advisory note at the start about the error, and show below how I corrected my error.

Confession #2:  My approach was a much longer and far less elegant solution than the identical approaches offered by a comment by “P” on my last post and the solution offered on FiveThirtyEight.  Rather than just accepting the alternative solution, as too many students are wont to do, I acknowledged the more efficient approach of others before proceeding to find a way to get the answer through my initial idea.

I’ll also admit that I didn’t immediately see the simple approach to the answer and rushed my post in the time I had available to get it up before the answer went live on FiveThirtyEight.

GENERAL STRATEGY and GOALS:

1-Use a PDF:  The original FiveThirtyEight post asked for the expected time before the siblings simultaneously finished their tasks.  I interpreted this as expected value, and I knew how to compute the expected value of a pdf of a random variable.  All I needed was the potential wait times, t, and their corresponding probabilities.  My approach was solid, but a few of my computations were off.

2-Use Self-Similarity:  I don’t see many people employing the self-similarity tactic I used in my initial solution.  Resolving my initial solution would allow me to continue using what I consider a pretty elegant strategy for handling cumbersome infinite sums.

A CORRECTED SOLUTION:

Stage 1:  My table for the distribution of initial choices was correct, as were my conclusions about the probability and expected time if they chose the same initial app.

App1

My first mistake was in my calculation of the expected time if they did not choose the same initial app.  The 20 numbers in blue above represent that sample space.  Notice that there are 8 times where one sibling chose a 5-minute app, leaving 6 other times where one sibling chose a 4-minute app while the other chose something shorter.  Similarly, there are 4 choices of an at most 3-minute app, and 2 choices of an at most 2-minute app.  So the expected length of time spent by the longer app if the same was not chosen for both is

E(Round1) = \frac{1}{20}*(8*5+6*4+4*3+2*2)=4 minutes,

a notably longer time than I initially reported.

For the initial app choice, there is a \frac{1}{5} chance they choose the same app for an average time of 3 minutes, and a \frac{4}{5} chance they choose different apps for an average time of 4 minutes.

Stage 2:  My biggest error was a rushed assumption that all of the entries I gave in the Round 2 table were equally likely.  That is clearly false as you can see from Table 1 above.  There are only two instances of a time difference of 4, while there are eight instances of a time difference of 1.  A correct solution using my approach needs to account for these varied probabilities.  Here is a revised version of Table 2 with these probabilities included.

App4

Conveniently–as I had noted without full realization in my last post–the revised Table 2 still shows the distribution for the 2nd and all future potential rounds until the siblings finally align, including the probabilities.  This proved to be a critical feature of the problem.

Another oversight was not fully recognizing which events would contribute to increasing the time before parity.  The yellow highlighted cells in Table 2 are those for which the next app choice was longer than the current time difference, and any of these would increase the length of a trial.

I was initially correct in concluding there was a \frac{1}{5} probability of the second app choice achieving a simultaneous finish and that this would not result in any additional total time.  I missed the fact that the six non-highlighted values also did not result in additional time and that there was a \frac{1}{5} chance of this happening.

That leaves a \frac{3}{5} chance of the trial time extending by selecting one of the highlighted events.  If that happens, the expected time the trial would continue is

\displaystyle \frac{4*4+(4+3)*3+(4+3+2)*2+(4+3+2+1)*1}{4+(4+3)+(4+3+2)+(4+3+2+1)}=\frac{13}{6} minutes.

Iterating:  So now I recognized there were 3 potential outcomes at Stage 2–a \frac{1}{5} chance of matching and ending, a \frac{1}{5} chance of not matching but not adding time, and a \frac{3}{5} chance of not matching and adding an average \frac{13}{6} minutes.  Conveniently, the last two possibilities still combined to recreate perfectly the outcomes and probabilities of the original Stage 2, creating a self-similar, pseudo-fractal situation.  Here’s the revised flowchart for time.

App5

Invoking the similarity, if there were T minutes remaining after arriving at Stage 2, then there was a \frac{1}{5} chance of adding 0 minutes, a \frac{1}{5} chance of remaining at T minutes, and a \frac{3}{5} chance of adding \frac{13}{6} minutes–that is being at T+\frac{13}{6} minutes.  Equating all of this allows me to solve for T.

T=\frac{1}{5}*0+\frac{1}{5}*T+\frac{3}{5}*\left( T+\frac{13}{6} \right) \longrightarrow T=6.5 minutes

Time Solution:  As noted above, at the start, there was a \frac{1}{5} chance of immediately matching with an average 3 minutes, and there was a \frac{4}{5} chance of not matching while using an average 4 minutes.  I just showed that from this latter stage, one would expect to need to use an additional mean 6.5 minutes for the siblings to end simultaneously, for a mean total of 10.5 minutes.  That means the overall expected time spent is

Total Expected Time =\frac{1}{5}*3 + \frac{4}{5}*10.5 = 9 minutes.

Number of Rounds Solution:  My initial computation of the number of rounds was actually correct–despite the comment from “P” in my last post–but I think the explanation could have been clearer.  I’ll try again.

App6

One round is obviously required for the first choice, and in the \frac{4}{5} chance the siblings don’t match, let N be the average number of rounds remaining.  In Stage 2, there’s a \frac{1}{5} chance the trial will end with the next choice, and a \frac{4}{5} chance there will still be N rounds remaining.  This second situation is correct because both the no time added and time added possibilities combine to reset Table 2 with a combined probability of \frac{4}{5}.  As before, I invoke self-similarity to find N.

N = \frac{1}{5}*1 + \frac{4}{5}*N \longrightarrow N=5

Therefore, the expected number of rounds is \frac{1}{5}*1 + \frac{4}{5}*5 = 4.2 rounds.

It would be cool if someone could confirm this prediction by simulation.

CONCLUSION:

I corrected my work and found the exact solution proposed by others and simulated by Steve!   Even better, I have shown my approach works and, while notably less elegant, one could solve this expected value problem by invoking the definition of expected value.

Best of all, I learned from a mistake and didn’t give up on a problem.  Now that’s the real lesson I hope all of my students get.

Happy New Year, everyone!

How One Data Point Destroyed a Study

Statistics are powerful tools.  Well-implemented, they tease out underlying patterns from the noise of raw data and improve our understanding.  But statistics must take care to avoid misstatements.   Unfortunately, statistics can also deliberately distort relationships, declaring patterns where none exist.  In my AP Statistics classes, I hope my students learn to extract meaning from well-designed studies, and to spot instances of Benjamin Disraeli’s “three kinds of lies:  lies, damned lies, and statistics.”

This post explores part of a study published August 12, 2015, exposing what I believe to be examples of four critical ways statistics are misunderstood and misused:

  • Not recognizing the distortion power of outliers in means, standard deviations, and in the case of the study below, regressions.
  • Distorting graphs to create the impression of patterns different from what actually exists,
  • Cherry-picking data to show only favorable results, and
  • Misunderstanding the p-value in inferential studies.

THE STUDY:

I was searching online for examples of research I could use with my AP Statistics classes when I found on the page of a math teacher organization a link to an article entitled, “Cardiorespiratory fitness linked to thinner gray matter and better math skills in kids.”  Following the URL trail, I found a description of the referenced article in an August, 2015 summary article by Science Daily and the actual research posted on August 12, 2015 by the journal, PLOS ONE.

As a middle and high school teacher, I’ve read multiple studies connecting physical fitness to brain health.  I was sure I had hit paydirt with an article offering multiple, valuable lessons for my students!  I read the claims of the Science Daily research summary correlating the physical fitness of 9- and 10-year-old children to performance on a test of arithmetic.  It was careful not to declare cause-and-effect,  but did say

The team found differences in math skills and cortical brain structure between the higher-fit and lower-fit children. In particular, thinner gray matter corresponded to better math performance in the higher-fit kids. No significant fitness-associated differences in reading or spelling aptitude were detected. (source)

The researchers described plausible connections for the aerobic fitness of children and the thickness of cortical gray matter for each participating child.  The study went astray when they attempted to connect their findings to the academic performance of the participants.

Independent t-tests were employed to compare WRAT-3 scores in higher fit and lower fit children. Pearson correlations were also conducted to determine associations between cortical thickness and academic achievement. The alpha level for all tests was set at p < .05. (source)

All of the remaining images, quotes, and data in this post are pulled directly from the primary article on PLOS ONE.  The URLs are provided above with bibliographic references are at the end.

To address questions raised by the study, I had to access the original data and recreate the researchers’ analyses.  Thankfully, PLOS ONE is an open-access journal, and I was able to download the research data.  In case you want to review the data yourself or use it with your classes, here is the original SPSS file which I converted into Excel and TI-Nspire CAS formats.

BEWARE OUTLIERS and MISLEADING SCALES:

My suspicions were piqued when I saw the following two graphs–the only scatterplots offered in their research publication.

fitness1

Scatterplot 1:  Attempt to connect Anterior Frontal Gray Matter thickness with WRAT-3 Arithmetic performance

The right side of the top scatterplot looked like an uncorrelated cloud of data with one data point on the far left seeming to pull the left side of the linear regression upwards, creating a more negative slope.  Because the study reported only two statistically significant correlations between the WRAT tests and cortical thickness in two areas of the brain, I was now concerned that the single extreme data point may have distorted results.

My initial scatterplot (below) confirmed the published graph, but fit to the the entire window, the data now looked even less correlated.

fitness3

In this scale, the farthest left data point (WRAT Arithmetic score = 66, Anterior Frontal thickness = 3.9) looked much more like an outlier.  I confirmed that the point exceeded 1.5IQRs below the lower quartile, as indicated visually in a boxplot of the WRAT-Arithmetic scores.

fitness7

Also note from my rescaled scatterplot that the Anterior Frontal measure (y-coordinate) was higher than any of the next five ordered pairs to its right.  Its horizontal outlier location, coupled with its notably higher vertical component, suggested that the single point could have significant influence on any regression on the data.  There was sufficient evidence for me to investigate the study results excluding the (66, 3.9) data point.

fitness4

The original linear regression on the 48 (WRAT Arithmetic, AF thickness) data was AF=-0.007817(WRAT_A)+4.350.  Excluding (66, 3.9), the new scatterplot above shows the revised linear regression on the remaining 47 points:  AF=-0.007460(WRAT_A)+4.308.  This and the original equation are close, but the revised slope is 4.6% smaller in magnitude relative to the published result. With the two published results reported significant at p=0.04, the influence of the outlier (66, 3.9) has a reasonable possibility of changing the study results.

Scatterplot 2:  Attempt to connect Superior Frontal Gray Matter thickness with WRAT-3 Arithmetic performance

The tightly compressed scale of the second published scatterplot made me deeply suspicious the (WRAT Arithmetic, Superior Frontal thickness) data was being vertically compressed to create the illusion of a linear relationship where one possibly did not exist.

Rescaling the the graphing window (below) made those appear notably less linear than the publication implied.  Also, the data point corresponding to the WRAT-Arithmetic score of 66 appeared to suffer from the same outlier-influences as the first data set.  It was still an outlier, but now its vertical component was higher than the next eight data points to its right, with some of them notably lower.  Again, there was sufficient evidence to investigate results excluding the outlier data point.

fitness2

The linear regression on the original 48 (WRAT Arithmetic, SF thickness) data points was SF=-0.002767(WRAT_A)+4.113 (above).  Excluding the outlier , the new scatterplot (below) had revised linear regression, SF=-0.002391(WRAT_A)+4.069.  This time, the revised slope was 13.6% smaller in magnitude relative to the original slope.  With the published significance also at p=0.04, omitting the outlier was almost certain to change the published results. 

fitness5

THE OUTLIER BROKE THE STUDY

The findings above strongly suggest the published study results are not as reliable as reported.  It is time to rerun the significance tests.

For the first data set–(WRAT Arithmetic, AF thickness) —run an independent t-test on the regression slope with and without the outlier.

  • INCLUDING OUTLIER:  For all 48 samples, the researchers reported a slope of -0.007817, r=-0.292, and p=0.04.  This was reported as a significant result.
  • EXCLUDING OUTLIER:  For the remaining 47 samples, the slope is -0.007460, r=-0.252, and p=0.087.  The r confirms the visual impression that the data was less linear and, most importantly, the correlation is no longer significant at \alpha <0.05.

For the second data set–(WRAT Arithmetic, SF thickness):

  • INCLUDING OUTLIER:  For all 48 samples, the researchers reported a slope of -0.002767, r=-0.291, and p=0.04.  This was reported as a significant result.
  • EXCLUDING OUTLIER:  For the remaining 47 samples, the slope is -0.002391, r=-0.229, and p=0.121.  This revision is even less linear and, most importantly, the correlation is no longer significant for any standard significance level.

In brief, the researchers’ arguable decision to include the single, clear outlier data point was the source of any significant results at all.  Whatever correlation exists between gray matter thickness and WRAT-Arithmetic as measured by this study is tenuous, at best, and almost certainly not significant.

THE DANGERS OF CHERRY-PICKING RESULTS:

So, let’s set aside the entire questionable decision to keep an outlier in the data set to achieve significant findings.  There is still a subtle, potential problem with this study’s result that actually impacts many published studies.

The researchers understandably were seeking connections between the thickness of a brain’s gray matter and the academic performance of that brain as measured by various WRAT instruments.  They computed independent t-tests of linear regression slopes between thickness measures at nine different locations in the brain against three WRAT test measures for a total of 27 separate t-tests.  The next table shows the correlation coefficient and p-value from each test.

fitness6

This approach is commonly used with researchers reporting out only the tests found to be significant.  But in doing so, the researchers may have overlooked a fundamental property of the confidence intervals that underlie p-values.  Using the typical critical value of p=0.05 uses a 95% confidence interval, and one interpretation of a 95% confidence interval is that under the conditions of the assumed null hypothesis, results that occur in most extreme 5% of outcomes will NOT be considered as resulting from the null hypothesis, even though they are.

In other words, even under they typical conditions for which the null hypothesis is true, 5% of correct results would be deemed different enough to be statistically significant–a Type I Error.  Within this study, this defines a binomial probability situation with 27 trials for which the probability of any one trial producing a significant result even though the null hypothesis is correct, is p=0.05.

The binomial probability of finding exactly 2 significant results at p=0.05 over 27 trials is 0.243, and the probability of producing 2 or more significant results when the null hypothesis is true is 39.4%.

fitness8

That means there is a 39.4% probability in any study testing 27 trials at a p<0.05 critical value that at least 2 of those trials would report a result that would INCORRECTLY be interpreted as contradicting the null hypothesis.  And if more conditions than 27 are tested, the probability of a Type I Error is even higher.

Whenever you have a large number of inference trials, there is an increasingly large probability that at least some of the “significant” trials are actually just random, undetected occurrences of the null hypothesis.

It just happens.

THE ELUSIVE MEANING OF A p-VALUE:

For more on the difficulty of understanding p-values, check out this nice recent article on FiveThirtyEight Science–Not Even Scientists Can Easily Explain P-Values. 

CONCLUSION:

Personally, I’m a little disappointed that this study didn’t find significant results.  There are many recent studies showing the connection between physical activity and brain health, but this study didn’t achieve its goal of finding a biological source to explain the correlation.

It is the responsibility of researchers to know their studies and their resulting data sets.  Not finding significant results is not a problem.  But I do expect research to disclaim when its significant results hang entirely on a choice to retain an outlier in its data set.

REFERENCES:

Chaddock-Heyman L, Erickson KI, Kienzler C, King M, Pontifex MB, Raine LB, et al. (2015) The Role of Aerobic Fitness in Cortical Thickness and Mathematics Achievement in Preadolescent Children. PLoS ONE 10(8): e0134115. doi:10.1371/journal.pone.0134115

University of Illinois at Urbana-Champaign. “Cardiorespiratory fitness linked to thinner gray matter and better math skills in kids.” ScienceDaily. http://www.sciencedaily.com/releases/2015/08/150812151229.htm (accessed December 8, 2015).

 

 

Recentering Normal Curves, revisited

I wrote here about using a CAS to determine a the new mean of a recentered normal curve from an AP Statistics exam question from the last decade.  My initial post shared my ideas on using CAS technology to determine the new center.  After hearing some of my students’ attempts to solve the problem, I believe they took a simpler, more intuitive approach than I had proposed.

REVISITING:

In the first part of the problem, solvers found the mean and standard deviation of the wait time of one train: \mu = 30 and \sigma = \sqrt{500}, respectively.  Then, students computed the probability of waiting to be 0.910144.

normCAS1

The final part of the question asked how long that train would have to be delayed to make that wait time 0.01.  Here’s where my solution diverged from my students’ approach.  Being comfortable with transformations, I thought of the solution as the original time less some unknown delay which was easily solved on our CAS.

normCAS4

STUDENT VARIATION:

Instead of thinking of the delay–the explicit goal of the AP question–my students  sought the new starting time.  Now that I’ve thought more about it, knowing the new time when the train will leave does seem like a more natural question and avoids the more awkward expression I used for the center.

The setup is the same, but now the new unknown variable, the center of the translated normal curve, is newtime.  Using their CAS solve command, they found

newnorm

It was a little different to think about negative time, but they found the difference between the new time difference (-52.0187 minutes) and the original (30 minutes) to be 82.0187 minutes, the same solution I discovered using transformations.

CONCLUSION:

This is nothing revolutionary, but my students’ thought processes were cleaner than mine.  And fresh thinking is always worth celebrating.

Confidence Intervals via graphs and CAS

Confidence intervals (CIs) are a challenging topic for many students, a task made more challenging, in my opinion, because many (most?) statistics texts approach CIs via z-scores.  While repeatedly calculating CI endpoints from standard deviations explains the underlying mathematical structure, it relies on an (admittedly simple) algebraic technique that predates classroom technology currently available for students on the AP Statistics exam.

Many (most?) statistics packages now include automatic CI commands.  Unfortunately for students just learning what a CI means, automatic commands can become computational “black boxes.”  Both CAS and graphing techniques offer a strong middle ground–enough foundation to reinforce what CIs mean with enough automation to avoid unnecessary symbol manipulation time.

In most cases, this is accomplished by understanding a normal cumulative distribution function (cdf) as a function, not just as an electronic substitute for normal probability tables of values.  In this post, I share two alternatives each for three approaches to determining CIs using a TI-Nspire CAS.

SAMPLE PROBLEM:

In 2010, the mean ACT mathematics score for all tests was 21.0 with standard deviation 5.3.  Determine a 90% confidence interval for the math ACT score of an individual chosen at random from all 2010 ACT test takers.

METHOD 1a — THE STANDARD APPROACH:

A 90% CI excludes the extreme 5% on each end of the normal distribution.  Using an inverse normal command gives the z-scores at the corresponding 5% and 95% locations on the normal cdf.

normCAS6

Of course, utilizing symmetry would have required only one command.  To find the actual boundary points of the CI, standardize the endpoints, x, and equate that to the two versions of the z-scores.

\displaystyle \frac{x-21.0}{5.3} = \pm 1.64485

Solving these rational equations for x gives x=12.28 and x=29.72, or CI = \left[ 12.28,29.72 \right] .

Most statistics software lets users avoid this computation with optional parameters for the mean and standard deviation of non-standard normal curves.  One of my students last year used this in the next variation.

METHOD 1b — INTRODUCING LISTS:

After using lists as shortcuts on our TI-Nspires last year for evaluating functions at several points simultaneously, one of my students creatively applied them to the inverse normal command, entering the separate 0.05 and 0.95 cdf probabilities as a single list.  I particularly like how the output for this approach outputs looks exactly like a CI.

CI4

METHOD 2a — CAS:

The endpoints of a CI are just endpoints of an interval on a normal cdf, so why not avoid the algebra and additional inverse normal command and determine the endpoints via CAS commands?  My students know the solve command from previous math classes, so after learning the normal cdf command, there are very few situations for them to even use the inverse.

CI1

This approach keeps my students connected to the normal cdf and solving for the bounds quickly gives the previous CI bounds.

METHOD 2b (Alas, not yet) — CAS and LISTS:

Currently, the numerical techniques the TI-Nspire family uses to solve equations with statistics commands don’t work well with lists in all situations.  Curiously, the Nspire currently can’t handle the solve+lists equivalent of the inverse normal+lists approach in METHOD 1b.

CI5

But, I’ve also learned that problems not easily solved in an Nspire CAS calculator window typically crack pretty easily when translated to their graphical equivalents.

METHOD 3a — GRAPHING:

This approach should work for any CAS or non-CAS graphing calculator or software with statistics commands.

Remember the “f” in cdf.  A cumulative distribution function is a function, and graphing calculators/software treats them as such.  Replacing the normCdf upper bounds with an x for standard graphing syntax lets one graph the normal cdf (below).

Also remember that any algebraic equation can be solved graphically by independently graphing each side of the equation and treating the resulting pair of equations as a system of equations.  In this case, graphing y=0.05 and y=0.95 and finding the points of intersection gives the values of the CI.

CI2

METHOD 3b — GRAPHING and LISTS:

SIDENOTE:  While lists didn’t work with the CAS in the case of METHOD 2b, the next screen shows the syntax to graph both ends of the CI using lists with a single endpoint equation.

CI3

The lists obviously weren’t necessary here, but the ability to use lists is a very convenient feature on the TI-Nspire that I’ve leveraged countless times to represent families of functions.  In my opinion, using them in METHOD 3b again leverages that same idea, that the endpoints you seek are different aspects of the same family–the CI.

CONCLUSION:

There are many ways for students in their first statistics courses to use what they already know to determine the endpoints of a confidence interval.  And keeping students attention focused on new ways to use old information solidifies both old and new content.  Eliminating unnecessary computations that aren’t the point of most of introductory statistics anyway is an added bonus.

Happy learning everyone…