Let’s get to the specifics. What should we do? And here there is a number of very specific items that maybe we can take something away from. I do like the book in one sense in that Tufte at couple different points says, “Let’s not be dogmatic.” I’m gonna kind of talk about some principles and things you should do. But, if it looks really bad in your case or doesn’t make sense, don’t do it. It’s just that, here are some principles to keep in mind. Data is most effectively presented when you get rid of extraneous stuff, extra stuff. So you know a summary way of thinking about things is, we want to minimize chart junk. Just like crap that’s in the figure.
So what are some examples? Gridlines are an example. A lot of people’s figures have tons of gridlines. So here he illustrates it with Playfair. So remember Playfair, this famous British economist who kind of innovated in the space of digital visualization. 1785, the year before his great commercial atlas that we talked about before. This is how he was presenting the trade balance and you can see it’s pretty good. You see the lines, it’s pretty clear what’s going on. But what you’ll see immediately, there are tons of these really dark grid lines all over the figure and they’re pretty prominent. This line is thicker than the line we care about, actually here.
So, the data in some ways is obscured by the grid. Now, there’s probably a good reason that he did this. People are drawing these things by hand at this point and he wants to have accurate data. So he has a grid and he’s like tracing out the points and then connecting them, so there may have been some rationale for this. But, even by 1786 the next year, he’s presenting something very similar. Again a trade balance figure, and it’s so much better a figure. Just aesthetically. Okay, there are still some gridlines but there are fewer gridlines. They’re much lighter, they’re much thinner and the things that are prominent here are the data. The data is what’s prominent in the figure.
So you know, the same person over time, he kind of learned and iterated and revised to come up with a much for effective, clear presentation of data. And this is something we can all learn from. Like if you need gridlines, one of the things that Tufte says is, “Make them really light and grey and don’t let them swamp the data.” The data has to be front and center in any of our presentations of data.
That’s one thing. Another thing that you see and you see this a lot in people’s figures is, people use excessive decoration and color and all these really annoying patterns. In Excel or in a lot of statistical programs there are these, sort of like, defaults. So if I have 5 categories of a variable or something, the default is, one is checkered and one is stripes and one is crosshatch and whatever. And people just go with the default, because it’s kind of useful. But a lot of those patterns are just terrible. They’re visually terrible and they really distract from the data. So he gives an example from a long time ago.
This is from the 1920’s, but it was just the ultimate example. You can’t blame computer software for this, this is 1927.
Do you really need these crosshatches here? What is this, like a Formula 1 race? Like, what do I need this starting flag here? All these colors, there are labels and colors and crosshatching. And then, of course, there’s the pie chart, which you can’t visually compare these quantities to each other but it kind of looks pretty and it’s circular. So there is all this, potentially a lot of data here, but it’s very hard to absorb. Another thing to avoid, avoid computer abbreviations or other unintelligible jargon in figures. Sometimes people are too lazy and they just leave the variable name from their raw data set in a figure. Or, they use some other abbreviation that means something to them.
It’s like shockingly common! Make your figures intelligible to people, don’t do that. That’s chart junk, according to Tufte. The other thing that chart junk does is it fills in space when there is not much data. So, it’s sort of like, when there isn’t that much data the space is filled in with crap. And in this case it’s better to just have a simple table. If you only have a limited number of things – observations, data points. Better just to put it in a simple table. People can digest 5 or 10 data points. So this is an example, and again, this is some of Tufte’s words on top.
He says, “A series of weird 3-dimensional displays appeared in the magazine American Education in the 70’s, delighting connoisseurs of the graphically preposterous.” Here 5 colors report, almost by happenstance, only 5 pieces of data since the top and the bottom are just mirror images of each other. One minus the other. So there’s absolutely no data. You could cut this in half and not lose any data. He says, “This may well be the worse graphic ever to find its way into print.” So again, I had to put what he said was the best one and the worst one. So here you could have a simple table, like in 1973 it was this and now it’s that.
You could probably summarize this pretty well with 2 numbers. Like, that number in 1973 and that number in 1976. So this is truly terrible, don’t do this.
Reduce the use of colors when possible. 5-10 of the population is either color blind or color deficient and you don’t really need the color differences most of the time. You can use line thickness, you use greyscale, you could use other things. The other thing is, with some exceptions, and Tufte talks about this, but most colors lack a natural hierarchy. Some do, there are these kind of heat map type plots from blue to green to yellow to red. Where your eye’s immediately drawn to some kind of hierarchy. You know, these pollution maps you kind of know the red areas aren’t that good. There’s a lot of pollution there, and your eye is drawn to that.
But if it’s yellow versus blue versus red versus green in no ordered way, what’s the point? And in this case, it may be much better to use shades of grey or shades of blue or something that’s pretty easy to see in black and white. Even people who are color deficient can kind of see. Okay, add more data into your graphic. Make your graphic more data rich. There are a lot of ways of doing this. You know, sometimes we have a scatterplot. And we could easily put a symbol or an abbreviation denoting different types of observations, and that adds richness to the figure. You know, sometimes people put country observations. Sometimes you might be able to put symbols.
Let’s say you add 500 people and they were different ages or different genders. You might be able to incorporate that into the figure. And if you abstract away from that you’ll still get the scatterplot you had before. But, you might see some interesting patterns. Others may notice patterns for some groups. It’s pretty easy to do, it makes your graphic much more data rich. So we should try to do that when possible.
In some ways, as we bring more and more data, and data points have classifiers and other things, we create something that may look a little bit like a hybrid figure table. And maybe that’s kind of the ideal. So this is an example. A famous 1919 plot, where each data point here, every entry is the number of an army unit for the U.S. army that went to Europe during WW1. Month by month. You can immediately see a bunch of things. You can see the number of units that were in Europe in each month, that’s the height. You can see how long each unit was in Europe, that’s the sort of width of the figure.
So, between when they show up and the end of the war. And you know which ones were there. So this is like an incredibly simple figure. It takes up very little space, it’s incredibly data rich and basically every item here is data.
Replace the full access going to the origin and back out with the data range. So, let’s say the access goes from 0-100 but all my data is between 20-70, only plot the access from 20-70. Now, all of a sudden, you’ve taken something that had no data, and it is data, it tells you the max and the min. Like, how much did you lose by doing that? Actually you’ve gained? Now we can see the max and the min in the figure, everybody knows where they would intersect at 0. But, who cares? Like, that isn’t data, there’s nothing new there. That’s one way to do it.
The other thing is, there may be simple ways of portraying the distribution of the data on the axes. Here’s a scatterplot, and this would be “x” and “y”. Whatever, we may care about that for some reason. But what he has here are tick marks wherever there’s data. So now I have the univariate distribution of “x” and the univariate distribution of “y”, and I can just see it visually. Kind of a very similar point, a related point, is to try to integrate graphics, data and text together. Kind of more broadly. You know, famous early scientists like Leonardo da Vinci, or artists scientists like Leonardo da Vinci, Galileo, etc. All their notes were littered with graphics, and they integrated them throughout.
There would be writing and integrate what they were talking about, together. And there was no distinction between, here’s the text and here’s the figure, the way we often write our papers. The early scientists integrated these things and that was the best way to convey their ideas. And the question is, can we do this again? So here’s a couple of examples, and this is where I’m gonna get to some recent figures. So one thing that Tufte advances in this sort of second edition of his book is what he calls “sparklines.” But you know it doesn’t have to be called that or anything else necessarily. But this is sort of Word-size bits of data.
He’s basically pushing for integrating little bits of data and text, or little bits of data that are sort of the same scale as a word. So very, kind of simple idea. But if I’m a physician, let’s say this is a screen, I get a lot of summary information here. I care about this patient’s glucose, breathing, their temperature. I get their current value, I get their last 12 hours here, and if the shaded area is the normal range, I get data that they’re out of the normal range. So this is incredibly tight, there’s a lot of data here in a very small amount of space. So this is a pretty effective display of data.
Now, this may not be exactly what we put in our research papers all the time, but maybe it is? And you see some of this in some papers. You know, sometimes when people use little bite-size bits of data on a page, 50 plots together, that’s kind of in the same spirit as this. It’s really a way to show the data in a very transparent way and you may see patterns there that you didn’t realize existed. What too few of us do with our figures and our tables, is sit down critically with them and edit them the way we edit text. And that’s really weird because there are a lot of people who look at our tables and our figures.
And a lot fewer people who actually carefully read every word of an article, at the end of the day. Especially with the rise of graphical abstracts, and with summary figures being the things that get circulated and that people look at. Summary tables and summary figures. We should be spending as much time iterating on and revising our tables and our figures as we do our text, but we don’t. We obsess over words and section 5.2. And spend an hour reworking a paragraph, but rarely, unfortunately, do many social scientists sit down for hours and obsess over every detail of their figures. Some people do, but a lot of people don’t.
And the evidence that a lot of people don’t is, that in almost any seminar you’re in you can look at a figure and immediately think of 3 or 4 ways to improve it.