Want to keep learning?

This content is taken from the Coventry University's online course, Get ready for a Masters in Data Science and AI. Join the course to learn more.

Sizing up data

When it comes to data, size is usually described in units of Kilobyte (KB), Megabyte (MB), Gigabyte (GB) and Terabyte (TB). What do these terms mean?

Before we consider data measurements, let’s use something we’re all familiar with - distance. When measuring distances, the metre (m) is the metric unit for length. Sometimes we need to use larger units. For example, distances between towns can be given in units of kilometres where the prefix kilo means 1,000, ie 1,000 metres is one kilometre (km). Sometimes we ned to use smaller units. For example the prefix nano means means billionth, ie \(10^{-9}\) metres (or 0.000000001 m) is 1 nanometre (nm).

In the previous step we saw how data is stored in a computer as binary digits (bits), organised into multiples of 8-bit *bytes. We saw that one character in UTF-8 is stored as one byte. Sometimes we need to use units larger than one byte. For example, the text of the Complete Works of Shakespeare can be downloaded from Project Gutenberg as a 5.5MB text file. Here “MB” is units of megabytes, where the prefix mega means million, ie 1 megabyte (MB) is 1000000 bytes.

The table below shows the standard prefixs and how they apply to units of weight (grams) and units of data (bytes).

1 g gram     1 B byte
1000 kg kilogram 1000 kB kilobyte
\(1000^2\) Mg megagram \(1000^2\) MB megabyte
\(1000^3\) Gg gigagram \(1000^3\) GB gigabyte
\(1000^4\) Tg terragram \(1000^4\) TB terrabyte
\(1000^5\) Pg petagram \(1000^5\) PB petabyte
\(1000^6\) Eg exagram \(1000^6\) EB exabyte
\(1000^7\) Zg zettagram \(1000^7\) ZB zettabyte
\(1000^8\) Yg yottagram \(1000^8\) YB yottabyte

It might seem strange to use such terms for weight but, for example, the mass of the Earth is approximately \(5.972\times10^{24}\) kilograms which is \(5972\times 1000^8\) grams, ie 5972 yottagrams. We are in the so-called zettabyte era since it is estimated that in 2020 all the data stored in the world adds up to approximately 40 zettabytes.

When planning to collect data, how do you know how much space it will take up on your filesystem. For example, Wikipedia has pages and pages of information within it, but the Wikipedia article Size of Wikipedia shows that the size of the English Wikipedia database is approximately 16GB (compressed). Anyone could easily download a complete copy of Wikipedia if they wanted to.

A typical modern laptop or desktop PC might have 8GB of memory and a 1TB hard drive. Depending on what we’re using our computer for, we sometimes wish for a computer with more memory or a larger hard drive. Next time you go shopping for a computer and ponder what size hard drive you need, consider the first hard drive created by IBM in 1956 had a capacity of 5MB and the first PC created by IBM in 1981 could only hold a maximum of 256KB of memory. Over the years since then, devices have become smaller and cheaper while memory and storage get larger. How much data do you think your devices will be able to store in 50 years?

Your task

The infographic Data Never Sleeps gives an idea of how much data is generated every minute.

Search for some information on the average length of a tweet and use it to estimate the size of a day’s worth of tweets worldwide.


References

DOMO. (n.d.). Data never sleeps 7.0. https://www.domo.com/learn/data-never-sleeps-7 (Retrieved 2020, July 24).

Wikipedia. (2020, July 24). Wikipedia:Size of Wikipedia. https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia#Size_of_the_English_Wikipedia_database

Wikipedia. (2020, July 7). Zettabyte Era. https://en.wikipedia.org/wiki/Zettabyte_Era

Share this article:

This article is from the free online course:

Get ready for a Masters in Data Science and AI

Coventry University