Part 4: Beware of Data Snooping

Correlation Does Not Imply Causation, So What Does?

This video is part of the Correlation Does Not Imply Causation, So What Does series, presented by Jim Colton, Lead Statistical Consultant at GraphPad.

Transcript:

All right. Also I want you to be careful of what's called data snooping. I gave the example earlier with baseball where I looked at a lot ...

This video is part of the Correlation Does Not Imply Causation, So What Does series, presented by Jim Colton, Lead Statistical Consultant at GraphPad.

Transcript:

All right. Also I want you to be careful of what's called data snooping. I gave the example earlier with baseball where I looked at a lot of correlations, and there is a danger in that. Here I've got 10 randomly simulated columns. This is the Prism correlation matrix again, and dark blue means that there's a strong positive correlation, and I have a a 0.77 correlation, which is very strong, close to 1, and I also have a negative 0.74 below it, also very strong in the negative direction.

It looks ... and I didn't simulate until I got this. This is just one ... first simulation, this is what I got with 10 columns of data, sample size of 10 in each. You can see... oh, you get really excited about, hey, in this data set, there's some good correlations. Maybe we should explore these further. And if you look at the P values, the P value for that one correlation is 0.009, so that would be highly significant.

There's a statistical relationship between, I don't know, column one and three, or whatever it was. However, we're looking at 45 P values in this example, and you can't just look at 45 P values or 45 numbers or whatever you're looking at, pick out the best or worst one and only talk about that without accounting for the fact that you looked at that many. If you adjust for looking at that many P values, which Prism has a tool for that, you'll find that there's really nothing surprising about a P value of 0.009. This is called data snooping if you were to pick that off and only report that number without accounting or adjusting for all the things. Just want to point that out.

Show more