So let me demonstrate how you can use these negative controls to evaluate a specific study design.
So going back to that paper that I showed you earlier: “studying isotretinoin and inflammatory bowel disease.” The estimate that the odds ratio that they reported
in that paper is shown here: odds ratio four point three six. And applauding it in this graph, and I’m gonna go back to this graph a little bit more later on, so I need to explain it very carefully. Onthe x-axis is the odds ratio to the effect size, and under y-axis is the standard error which is basically the width of the confidence interval. And so the higher that you are, the wider the confidence interval, usually that means you have less data. More data you go down, less data you go up. So nice way of plotting it like that is you can draw this dashed line.
And that dashed line represents the region where B equals 0.05, and so everything below the dash line is statistically significant with a p-value smaller than 0.05. Of course, the odds ratio that this paper reported is statistically significant because it’s below the dashed line. Now we wanted to evaluate this design. But, of course, we didn’t have their analysis code. We didn’t have their data. So we try to replicate it as best as we could. We also have a U.S. health insurance database that we can use. And so I went through the paper line by line and implemented a design that I thought was exactly what they did. And I came up with this odds ratio. Our database was bigger than theirs.
So that’s why the standard error is smaller. But as you can see is actually pretty spot-on and our confidence interval is exactly within there. So I think we did, I did a pretty good job of replicating that. The reason why I did what that was of course that I want to add negative controls. So we came up with roughly 50 negative controls. So these are drugs that anybody that would look at them would be pretty readable that, yeah, that can cause inflammatory bowel disease. And we ran the same design on the same data for those fifty negative controls.
So each one of those blue dots is an estimate an odds ratio estimate for one of these negative controls, and remember, we all believe that the true odds ratio should be one. So why is this method? Or this specific design saying that there are lots of odds ratios greater than one? And actually, almost all of them are below the dashed line, meaning almost all of them are statistically significant Things like confounding, like we’re comparing cases to control. People who have inflammatory bowel disease to people who don’t have inflammatory bowel disease. Well there lots of differences between those people, except, not just the fact that they’d have the disease or not.
And so this method just does a bad job of adjusting for that component.
So that’s that’s pretty damning. Especially as you can see that the estimate that we had for istretinoin is well within this cloud of negative controls. So if you think about a p-value, it tells you … And I reject the null hypothesis, well, in this case, well, it’s in the middle of things where the null hypothesis is true. So it actually can’t, and so we we came up with a way of formalizing that. And computing a calibrated p-value which takes into account this information has learned from the negative controls. And we can actually draw this orange area which is where B is smaller than 0.05 after empirical calibration.
So you see, as you would expect, that this estimate that we had is no longer statistically significant after empirical calibration. Probably a lot of you’re already thinking now “hey, this guy was foreigner right? He’s making a bad effect, Go away.” So just to be clear I’m not saying that isotretinoin doesn’t cause inflammatory bowel disease. Actually we don’t know. All I’m saying is that this specific design, as used on this data, cannot tell you one way or the other. So we can’t reject the null that doesn’t mean that the null is true. It’s just that we can learn from this with a specific study.
I wrote a paper about this process already a while back. So this is about one study, but there is actually another problem that I want to highlight. If we think about “how does a study happened?” Well, we start with an idea for a study. We then perform the study and I just showed you that that can be problematic and how bad that is. But then we submit the paper for publication and hopefully get it published. And we end up hopefully in PubMed.
But how well does that whole process work? That’s a little bit more tricky. We can’t really use a good gold standard there. But we can just look at the output, like what ends up in PubMed?
So what I’m showing in this plot is. What I did is, I went through PubMed, I extracted all the papers from the observational study using a database, like the Taiwanese database, but, of course, also CPRD from the UK or other observational databases. And from those PubMed abstracts, I just extracted automatically all of the effect size estimates that were reported including the p-values and confidence intervals. So every one of these dots is one of those estimates that we extracted from the literature. Of course, this is about all sorts of different questions, different exposures, different outcomes, all sorts of different designs that were used on the data.
But despite the fact that this is really a lot of different things in one plot. We can see a very clear pattern I would argue. The pattern pretty obviously is that there’s this big gap in the middle. Then do not see many articles that talk about an effect size estimate that’s not statistically significant. Now this could be that researchers are just really good at picking a research question and they only pick the ones that they know beforehand will be statistically significant. But then still I would argue, well, I would like to know about all those things that are not statistically significant. This might be important to me as well, right?
I mean I want to know whether when something is safe. But there’s also there’s very suspicious sharp boundary at that, at these point of five. And that actually tells me, you know, it can’t just be that they’re good at picking it up questions. You know, it’s also, there’s a lot of publication bias
meaning that people just only publish things that are statistically significant. Maybe because reviewers reject everything else. But also something called P-hacking where you do a study, you find the results are not statistically significant. You then go back and basically rethink with the design of your study and until you actually get a statistically significant result. And both of those are actually bad. Because of the multiple testing that you’re doing but not actually reporting. You get a lot of false positives. And so in summary about this seeing how well things work, I would say the performance of this observational research at an individual study level is pretty bad.
Because of the bias that we have, due to confounding selection bias and measurement error. But as a whole, it’s even worse. Because of publication bias and P-hacking.