## Want to keep learning?

This content is taken from the Bond University's online course, Data Analytics for Decision Making: An Introduction to Using Excel. Join the course to learn more.
1.7

## Bond University

Skip to 0 minutes and 0 secondsNow, let's have a think about some different data. Think about a telecommunications company, investigating the amount of the first bill for new customers. Like what kind of size is the average bill for new customers, is it too high, how new customers fairing see if we can get some information from this. Maybe we have 200 bills, recent new customers. So you know for example here's what it looks like we've got 42.19 for the first 38.45, 29.23 etc as we go down so clearly not in yen a bit small for that, but looks like it's more in Australian dollars or American dollars. Now this is not really looking like a pie chart is it?

Skip to 0 minutes and 47 secondsBecause you know there's different numbers and how does that all work? It's not obvious not really a bar chart either because there's every number's different. So mmm interesting so we need a new approach here. One approach is that we could start to summarise this in terms of categories. So what if we said how many bills are between say zero dollars and thirty dollars. How many are from 30 to 60, 60 to 90 and so on. Or we could go more detail how many between 0 and 15? Well, we can't see any here but how many between 15 and 30, or that 29.23 is in there. How many between 30 and 45? Well the 42 the 38's in there, okay?

Skip to 1 minute and 39 secondsSo we can start to get frequencies, and that's one way to summarise this. Obviously there was 200, so that's why there's more information here I was just showing the top view. But here we've got a class so we're saying everything from 0 up to 15 then from 15 up to 30, 30 up to 45 etc. And in Excel by convention we call these bins and we give it the upper limit there, the 15 the 30, the 45 to 60 cetera. So there's 71 bills from 0 to 15, 37 from 15 to 30 and now have a look at those frequencies we could graph those frequencies couldn't we? And that would look something like this.

Skip to 2 minutes and 22 secondsSee we can see the 71, okay there is just above 70 for the frequency from the 0 to 15. Then we sort of have the 37 we're sort of seeing and you can see it now like oh well there's a pattern here. This starts to tell you more information, and I'll direct you to a few things here notice the chart title 'histogram of first bill size', this is a histogram, and this is great for that numeric data. We break it into categories and then we plot the frequencies so that would be a histogram.

Skip to 2 minutes and 57 secondsNotice my y-axis I have it nicely labeled, frequency, and my x-axis I have labeled, first monthly bill, and notice I'm telling the read of the information here. It's the upper limit of the class, because it's not just there were 71 bills that were 15 dollars, no no no so I've told the reader it's the upper limit of the class. Sometimes you might want to write 0 up to 15, 15 up to 30 etc. So what kind of interpretation would we be looking at here? Well have a look here if we sort of break this down we can see that it you know a huge number of bills are very small, in fact about half.

Skip to 3 minutes and 44 secondsAnd also there's a few bills in the middle rent, but not much not many bills, between sort of you know 30 up to 75 there, not many bills is it? But there's actually a relatively large number that are quite high ,in that top area, okay sort of above 75. So notice the histogram tells us this this is a great example of data going to information data Oh, a bill was 42.23 next bill was 67.25, next bill was fourteen that doesn't tell you anything. But seeing the histogram that tells you something lots of bills are small but there some customers are having really large bills initially are they going to leave us? Why is that?

Skip to 4 minutes and 30 secondsMaybe we want to investigate more to see if we need to do something for them. So frequency is one thing but there's another thing called relative frequency. Sometimes we don't so much care that there's 71 bills, we care that there are 36% of bills are in the first category. So it's not the absolute number, it's the frequency, the relative frequency in fact that we care about, and relative frequency is simply at the frequency. The 71 for the first category, total number of observations so 71 out of 200.

Skip to 5 minutes and 11 secondsSo it tells us you know in percentage terms how many, and this can sometimes give us an insight into the whole population of bills, given that we only have a sample of 200 so we don't care about 71, because you know in fact there's thousands and thousands of bills. But we might say oh it looks like 36 percent are in that 0 to 15 category. So you can also compare different histograms based on different sample sizes being careful to consider whether that's appropriate but you can compare based on percentage terms rather than absolute number, because somebody else might have taken a sample 300 bills so they're obviously going to have different numbers in different categories.

Skip to 5 minutes and 56 secondsSo what does this look like? Exactly the same as before but notice now we have relative frequency. So we can see that 36 percent how is that calculated? 71 divided by 200. Now let's just do some simple logic checks here firstly we should have already checked all the frequencies should add up to 200 you've got to run all the relative frequencies have to add up to 100% or 1. Exactly, so there's some great logic checks. So, let's have a look at how this looks, oh, exactly the same, what's the difference? Ah the only difference here, look at the y-axis, relative frequency.

Skip to 6 minutes and 46 secondsSo now we can see it's zero zero point zero five it's not the actual number it's the relative frequency. So the only thing that changes is the y-axis there. So this was all well and good but how did we know we should have done zero to fifteen? Fifteen to thirty? Thirty to forty five? How do we know that that's the categories that we want? And this is actually an art, not an exact measure. So the number of categories depends on the size of the data set, how many observations you want to be in each category, more importantly business sense. What do you care about as a manager? What are the categories that you might care about?

Skip to 7 minutes and 32 secondsIf we're looking at people's ages maybe we often do people in their 20s, so we do twenty to twenty nine, thirty to thirty nine, forty to forty nine. Bills, maybe we care about fifteen dollar increments historically that's a key sort of amount. So it's about using a lot of business and common sense but just remember that there's a general recommendation to have between five and fifteen categories. Because if you have too fewer than five you're not really showing everything, you're showing you know everything's in one big lump. Well that doesn't show much about the shape or anything does it?

Skip to 8 minutes and 4 secondsAnd if you get more than fifteen categories it's very hard for our our minds to sort of conceptualise looking at, at what's going on here so sometimes you might need to have more than fifteen, but generally it's a good idea to be between five and fifteen. Notice how it's not a hard and fast rule, it's a bit of an art form here getting these graphs. So generally between five and fifteen sometimes you will need more, maybe sometimes you only need four but probably more likely to be bigger than fifteen if you have to get outside this range.

Skip to 8 minutes and 41 secondsSo let's just summarise up here a histogram, we collect the data we prepare a frequency distribution or relative frequency distribution, that was just that table. Then we can draw the histogram or the relative frequency histogram, depending on if we care about the absolute value we'll just use a normal histogram. Or if we want relative frequency if we care more about the proportion the percentage that are in each category.

# Histograms

While pie and bar charts are useful for visualising data where only a few values are possible (such as hair colours or product names), we need a different tool for continuous data where many values are possible.

This is where histograms are useful. In this video, Adrian shows their use with the example of phone bills, but I’m sure you can think of other examples as well.

Share in the comments below some other examples of histograms you can think of or are familiar with.

Next, we’ll look at two final graphical techniques for inspecting two variables and the relationships between them. These are line plots and scatter plots…