3.13

## University of California, Berkeley

Skip to 0 minutes and 0 secondsThere really is a feeling among social science researchers, or traditionally there was a feeling, that figures were kind of superfluous. There was kind of this sense of like, “Well, yeah, let’s put together some simple summary figure.” Kind of something trivial, or something simple. Not among all researchers, but there is this sense floating around in a lot of social science research. Like, you know, the serious people look at tables and the less serious people look at figures. Or something like that. Or, can only understand the figure. So, I think that’s an idea that one should push back against. Precisely because graphics and figures can illustrate patterns you can’t get in a table.

Skip to 0 minutes and 44 secondsIn fact, sometimes they’re the only way to understand certain relationships. So here are four different data sets. And these are just small numbered data points. You see the “x” variable. You see the “y” variable. It turns out these four data sets all have the same mean for the right-hand-side variable. All have the same mean for the dependent variable. They all basically have very similar statistical properties in terms of the best regression fit.

Skip to 1 minute and 16 secondsReally, all these sorts of diagnostics from a regression look exactly the same. Obviously, they’re constructed to be exactly the same for these four data sets. The correlation between the variables is the same etc. You could look at this. And you know, I can look at this “x” and “y” data till I’m blue in the face and I’m not gonna really get much of a sense of what this data looks like. But if I were to look at these and say, “Hey! All of these have the same properties. All these moments of the distribution and the fits are exactly the same.” It’s kinda like the same data, right? Maybe. Right? That would be our intuition statistically.

Skip to 1 minute and 49 secondsWell, it turns out these are the four data sets. 1, 2, 3 and 4. One of them looks like “real data.” You know? Not a very good fit, but there’s maybe a linear relationship. Some of these are totally “rigged” data. With really strange properties, where it’s not even clear what the right fit is. Or what the right model is that would generate this kind of data. But they have the same basic statistical properties. Now this is just an example where plotting your data, just the scatter plot, is gonna tell you immediately what the right sorts of specifications would be and tell you a lot about the data.

Skip to 2 minutes and 26 secondsYou may be able to identify outliers, you might be able to do a lot of things just by plotting the data. You know, even though people are better today than 5 or 10 years ago, at visualizing and plotting their data. You know, people very often don’t do this in an effective way. They start analyzing the data and running regressions before they’ve done some of the basic diagnostics on the data that they’re working with. This is an example of where it could really fail. This is already 40 years’ old and people are still Tweeting about it. So it’s gotta be pretty good. There’re some really good real world examples. And this was emphasized also in the Tufte book if people remember.

Skip to 3 minutes and 4 secondsDr. John Snow, extremely famous physician, British physician. No relation to the “Game of Thrones” character. At the time that he was a physician, there were regular outbreaks of cholera and other infectious diseases in London and other big cities. And, physicians were worse than useless, in the sense that they had a whole bunch of theories that were totally wrong. And led people to be much sicker than they would have been if there hadn't been physicians. That’s probably true until maybe 1910. People have tried to look at the data in which the median physician actually improved people's health, rather than made them sick. So they’ve had maybe 100 years that they’ve been on the positive side of the ledger.

Skip to 3 minutes and 47 secondsBut he was one of the good ones and trying to figure out what was going on. So at that point, people thought these brilliant physicians, thought that cholera was spread by bad air, and miasma and all these other things with no basis in science. And they didn’t understand that a lot of these infectious diseases were water born. So, what did John Snow do? There was an outbreak of Cholera in a very populous neighborhood of London. He decided to plot the data on a map. So, he plotted the data. He got the data and all these dots are where people died of cholera, and then he plotted various other things on here. Various locations and water pumps.

Skip to 4 minutes and 29 secondsHe just decided to plot stuff. And he found this completely striking pattern, which is, the cholera cases were clustered around this one water pump on Broad Street. Right around here. The people who lived near the other water pumps weren’t dying of cholera. It turns out he went and talked to people here. Because there were some people dying over here. A number of the people in these outline areas who died were people who either went to school or worked in this area and drank from that pump. Other people were people who just liked the taste of the water from that pump and would go there from farther away. So, he created – He basically did modern epidemiological research.

Skip to 5 minutes and 8 secondsOr something approaching modern research, and identified this incredibly strong correlation between drinking from this water source and getting sick from Cholera. Of course, nobody believed him, that it was water related. It took decades for others to buy it. But it turned out that this water pump had been dug at a very shallow depth right near an old cesspit. Like an old latrine.

Skip to 5 minutes and 34 secondsAnd, the cesspit had sort of been covered up and sort of forgotten about. When they dug the well they didn’t go deep enough to avoid it. And basically, it was leaking fecal matter into the water source. And that’s what was killing people. Today it seems obvious that that was the source. But he discovered this relationship. Within a couple decades, this insight had caused major cities around the world to completely redo how they structure their water system and their sewage systems. So this was research that not only launched a major intellectual field, but this research saved... You know, if this research speeded up the adoption of good hygiene practices and sewage practices, it might have saved millions of lives.

Skip to 6 minutes and 14 secondsThis is maybe some of the most influential research, you know, we’ve seen. So Tufte writes, “Of course, the link might have been made statistically.” Maybe he could have gotten data on the distance from different water pumps or locations and run a regression. Sort of figured out there was some sort of correlation, with some good luck. But here, it really feels like the graphical analysis sheds light on the data much more effectively than statistics. Because, he may never have even realized the link with the water pumps if he hadn’t, sort of, visualized lots of data. He probably plotted schools and churches and all this other stuff and said, “Wait a minute, looks like this water pump.”

Skip to 6 minutes and 51 secondsThen he started doing interviews and he sort of built on it. So, this is a case where statistics aren’t enough. And another thing that I think is effective, is first of all that statistical analysis wasn’t that developed at that time. You know, modern statistics emerges after this point. So it’s not even clear that the statistical tools were there. There weren’t even computers to do the analysis. How would he have done the statistical analysis? He kind of had to do a graph. Today we have some hope. There’s something very powerful about simple graphics or, I don’t want to say simple. There’s a ton of data on that graphic, so it’s not simple.

Skip to 7 minutes and 23 secondsBut there is something powerful about graphics that makes them immediately intelligible to people. So, even government officials then who didn’t know statistics, could see the relationship and understand it immediately. In a way that a statistical relationship might have been much harder for them to grasp. So, this is a particularly effective way of understanding this important relationship.

# How graphics reveal data

The power of visualized data cannot be understated. Well-designed graphics can be much more effective than extensive tables at communicating patterns and big picture ideas, not only to other scientists, but also to policymakers and other laypeople. In “The Visual Display of Quantitative Information”, Dr. Tufte presents many examples of how figures can convey important messages. In this video, Dr. Miguel goes through two of them. The first shows how different datasets with similar statistics can have wildly different visualizations. The second example illustrates how a real-world problem – cholera in 19th century London – was addressed after a particularly effective visual presentation of the disease’s geography.