Skip main navigation

Beyond 95%

Learn about 95% intervals

So far we’ve generally used 95% for confidence intervals. However, this is only by convention, and we can choose any confidence interval we want. We’ve seen that the values for error displays are calculated independent of the plots, so to change how error bars are drawn we just calculate their sizes.

Do note that we will not go into too much detail on how to change the confidence interval. You should be able to adjust it by tweaking the function you’re using to calculate it.

Plotting with mean and standard error

For example, in our average MPG plots, we can switch to an 80% confidence interval by changing one argument to our interval function call.

Code:

series_names = []
means = []
errors = []

confidence = 0.80

Next, we have already filtered the MPG to a specific origin.

Code:

for origin in mpg["origin"].unique():
mpg_for_origin = mpg[mpg["origin"] == origin]

Then we calculate the mean, count, and standard error of the data:

Code:

mean = mpg_for_origin["mpg"].mean()
count = len(mpg_for_origin)
std_error = mpg_for_origin["mpg"].sem()

Next, calculate the 80% confidence interval using the code snippet given here, as you have already mentioned previously, for it to be 0.80 in this example.

Code:

ci = st.t.interval(confidence, count - 1, loc=mean, scale=std_error)

11: Lastly, let us draw this plot by adding details of the figure, axes, and labels for the plot.

Code:

fig, ax = plt.subplots()
fig.set_size_inches(8, 8)
ax.bar(series_names, means, yerr=errors, facecolor="lightgreen", ecolor="red", capsize=3)
ax.set_xlabel("Origin")
ax.set_ylabel("MPG")

Output:

Screenshot of error bars shown on a bar chart for errors other than 95% confidence intervals. There are short red vertical lines on top of each bar chart. Y-axis labelled "MPG" reads 0, 5, 10, 15, 20, 25, 30. X-axis labelled "Origin" reads usa, japan, europe. The "usa" bar goes up to 20. The "japan" bar goes up to 30. The "europe" bar goes up to the in between 25 and 30. Click to enlarge

Do you notice how it changes the plot for the smaller error bars compared to the previous one?

This is what we would expect, as now there is only an 80% chance that the true mean lies between those points, rather than a 95% chance previously.

Plotting with mean, standard error, and standard deviation

We could also just plot the error bars using the standard deviation of the data. This assumes the data is normally distributed though, which it probably won’t be, so we’ll just show it as a demonstration here.

Be sure to follow the steps on your Notebook.

Step 1

We will begin just the way we did previously with the following code.

Code:

series_names = []
means = []
errors = []

confidence = 0.80

Step 2

The standard deviation can be calculated from the standard error of the data, which is calculated with the sem method.

Code:

for origin in mpg["origin"].unique():
mpg_for_origin = mpg[mpg["origin"] == origin]
mean = mpg_for_origin["mpg"].mean()
count = len(mpg_for_origin)
std_error = mpg_for_origin["mpg"].sem()
sd = std_error * np.sqrt(count)
series_names.append(origin)
means.append(mean)
errors.append(sd)

Step 3

We then just add the standard deviation to the errors list and plot it onto the bar chart.

Code:

fig, ax = plt.subplots()
fig.set_size_inches(8, 8)
ax.bar(series_names, means, yerr=errors, facecolor="lightgreen", ecolor="red", capsize=3)
ax.set_xlabel("Origin")
ax.set_ylabel("MPG")

Output:

Screenshot of error bars shown on a bar chart for errors other than 95% confidence intervals. There are long red vertical lines on top of each bar chart. Y-axis labelled "MPG" reads from bottom to top: 0, 5, 10, 15, 20, 25, 30, 35. X-axis labelled "Origin" reads from left to right: usa, japan, europe. The "usa" bar goes up to 20 with a red line that starts from just the middle of 10 and 15 then ends at the middle of 25 and 30 on the y-axis. The "japan" bar goes up to 30 with a red line that starts from 25 then ends beyond 35 on the y-axis. The "europe" bar goes up to the in between 25 and 30 with a red line that starts from just the middle of 20 and 25 then ends at 35 on the y-axis.Click to enlarge

It’s not advisable to deviate too far from the 95% confidence interval as it’s usually assumed that when an error bar is seen, that’s what it represents. If you do choose not to use 95% confidence intervals or standard deviations, you should make note of this somewhere.

Additional learning: Bootstrapping

That was about Matplotlib; Seaborn uses bootstrapping to calculate the 95% confidence interval of data. In essence, it’s a method of repeatedly resampling from a sample of the population, which gives good estimates of the true mean and 95% confidence.

We will not be going into the details of this, however you may want to learn in further detail, click on the video from The University of Auckland’s Professor Chris Wild for a great introduction to this technique, which helps to explain the process in more detail, step by step.

Watch: Confidence Intervals from Bootstrap resampling(8:23) [1]

Do you see any difference?

What difference do you find between the last two outputs (you would have had the same outputs on your Jupyter Notebooks as well)?

Share your observations with your fellow learners in the comments.

References

  1. Confidence Intervals from Bootstrap re-sampling [Video]. Wild About Statistics; 2015 Apr 1. Available from: https://www.youtube.com/watch?v=iN-77YVqLDw
This article is from the free online

Data Visualisation with Python: Seaborn and Scatter Plots

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now