Part 5: How to Establish Cause and Effect

Correlation Does Not Imply Causation, So What Does?

This video is part of the Correlation Does Not Imply Causation, So What Does series, presented by Jim Colton, Lead Statistical Consultant at GraphPad.

Transcript:

Now, I want to get to, probably the most important part of this talk, how to establish cause and effect. And I have three areas to talk a ...

This video is part of the Correlation Does Not Imply Causation, So What Does series, presented by Jim Colton, Lead Statistical Consultant at GraphPad.

Transcript:

Now, I want to get to, probably the most important part of this talk, how to establish cause and effect. And I have three areas to talk about here. Going from least effective to most. Okay?

So, including covariates. Let's talk about the palm size. You remember the males and females, the different palm sizes and longevity? So, I've got hand size on the X and longevity, age on the Y. If I put in male, female as a covariate in the analysis there's no relationship here.

So, when you count for other variables you can control, a lot of people use the word control for other variables. You can do a much better job of finding out what's a true relationship and what's not.

And in this case we just eliminated, really hand size as an important variable, because it was explained, that trend we thought we saw on the previous page was just explained by including male, female in the model.

So, covariates do a decent job of attempting to control for these lurking or hidden variables. You remember I talked about that at the beginning.

Multiple regression is the tool you'll use most of the time to control for those lurking variables. Prism 8 does have this multiple regression and the challenge is, we can't always identify all the ... In fact, we very often can't identify all the variables that we need to include. And even if we could, we can't measure all of them.

It's very hard to account for everything and measure everything all in one analysis. So, an observational dataset that just has a bunch of decent covariates, it probably isn't good enough, but at least it is an approach and it attempts to help you know what's cause and effect.

The second area is pure science. So, certain things that you can map out. Here we've got smoking causes mutations in the DNA, which leads to lung cancer. And so, scientifically you can show that it's likely, or logical that smoking causes lung cancer.

The challenge here is, it's difficult on science alone to say this will cause cancer in actual people, because we all have different DNA and our bodies are complicated. And you really need to collect some data to understand how strong that cause and effect relationship is, if it exists.

So, science is nice, but the gold standard is randomization. So, I want to talk about this. So, clinical trials are all based on this concept of randomization. You have your patients and you'll randomly assign patients to treatment and control with the hopes that all these extraneous variables, these lurking or hidden variables get equally balanced in the two groups.

So, this is the gold standard, if you can run this. The problem here, and I think a lot of you know, is you can't a lot of times ethically run these experiments. You can't tell people, "Smoke a pack of cigarettes for 10 years. You're in that group. You got assigned to that group. We'll give you the cigarettes for free, but it's still not going to work."

So, you're not going to get people, you can't force people to participate in many experiments. So, that's what the challenge is, but if you can run a randomized trial that's ideal.

Here, I'll go back to that. Estrogen and heart attacks, if you remember there was initially thought that estrogen or hormonal replacement therapy in women reduce the risk of heart attacks, but they found out after running some trials, randomized trials that in fact the opposite was true.

Yeah, that was the study I showed earlier. So, the randomized trial really, saved the day here and reversed the incorrect conclusion that people had based on just a correlation.

One question people often ask about randomization is, "What if it's unbalanced? What if I randomize, but I get more females than males in one group, or more older people than younger people?"

Well, that is going to happen. There's a guarantee that your randomization won't be perfectly balanced. It's impossible to balance everything perfectly, but you don't know what to worry about so to speak when you randomize, so you just trust the randomization and you set your level of significance.

So, when you do a test and you get a P value, that P value is accounting for the fact that when you randomize sometimes things will be unbalanced. So, when you get something significant, and here I've got a significance level of 1% 0.5% in each tail. If you're really worried about the randomization being unbalanced and hurting you, well, you can be a little bit more aggressive in setting your cutoff, or your significance level.

You could make it .01% instead of .05% or something even lower than .01% and that way the risk that you'll have a really unbalanced randomization hurt you will be less. You'll only reject when it's highly significant.

You can still add covariates when you randomize. Although, most people don't when they're randomized, but if you, statistically speaking, I don't know if there's rules in the FDA, but statistically speaking you can still add covariates if there's more ... You can put age in, or gender, or something in the model, even though it might be hopefully, roughly balanced with the randomization.

Also, set your inclusion, exclusion rules appropriately and before you collect the data, not after. You don't want to afterwards say, "Hey, we should have excluded 75 and over, because there're a couple in the treatment group that didn't do that well. "Let's get those." No. You probably know you can't do that already. So, set your exclusion rules so that you don't have really extreme patients in your study that might happen to fall in one group or the other, and really skew that result.