Want to keep learning?

This content is taken from the Raspberry Pi Foundation & National Centre for Computing Education's online course, Representing Data with Images and Sound: Bringing Data to Life. Join the course to learn more.
1.6

Raspberry Pi Foundation

Skip to 0 minutes and 2 seconds Now you’ve got a grasp on converting numbers into binary, let’s explore how computers turn these Bits into text using character encoding. Morse code is an early form of character encoding. Telegraph technology used Morse code to communicate letters and numbers through a series of dots and dashes. [TAPPING] So “hello” in Morse code looks and sounds like this. [BEEPING] […. H . E .-.. L .-.. L — O] However, it’s difficult for computers to mechanically process characters with so many different variations in length. To fix this problem, in 1874, Émile Baudot invented a 5-bit code to represent characters. [An image of a card with a row of small holes, with space for up to two holes above and three holes below.

Skip to 0 minutes and 49 seconds Different holes are punched out above and below each small hole.] Using this [five key] keyboard, you can encode up to 60 different characters. But this was still limiting, as you couldn’t have characters like uppercase letters. As the decades went by, our technology became more complex. By 1956, computers like the IBM Stretch represented data using 8 bits. 8 bits became known as a byte of data, a term we still use today. As computer hardware got even more complex, people would build upon previous technology by doubling the bit architecture, but it always remained a power of 2. Today we have 64-bit CPUs and some character encoding methods that are 32 bits. In the next step, you’ll learn about two modern-day character

Skip to 1 minute and 36 seconds encoding standards which use bytes of data: ASCII and Unicode.

The essentials of character encoding

From earlier sections in this course, you know that computers do not store digital media as letters, numbers, sounds, and pictures. Instead, computers work with bits, binary digits that have a value of 1 or 0 (on or off). To get from these bits to all the things you see thanks to computers, characters (for example letters) shown on-screen need to be encoded. Here, we will go through a little history of how computers encode language.

To understand character encoding, let’s go back to 1836, when Morse code was invented. Morse code uses telegraph technology to represent data electronically with switches over long distances. Instead of using ones and zeros, however, Morse code uses dots ., dashes -, and pauses, all of which we can describe as “symbols”. For example, to say “hello” in Morse code, you would use this code:

Character Morse code
H ….
E .
L .-..
L .-..
O - - -

We have come a long way since representing letters in Morse code, but the principles are still very similar: we encode information in different ways to represent data and bring it to life with computers. Let’s look at how we have progressed from Morse code to all the text you see on displays and screens.

Bits and bytes

If you look at the Morse code representation of ‘HELLO’, you can see different lengths of the codes: H and L are encoded by four symbols, E by one symbol, and O by three symbols. This is OK if you want human operators to understand the code, such as the people who send messages by telegraph. However, when you need alphabets or characters to be processed mechanically by a computer, you can run into issues with so many variations in code length.

In 1874, to fix this problem, Émile Baudot invented a five-bit code to represent characters. With this code, you can get up to 60 characters encoded using the keyboard below. Beside it is an example of the Baudot code that was produced by this keyboard.

However, being able to encode 60 characters is not enough to cover characters like uppercase letters and some numbers.

Because Baudot code represents each character by five bits, five bits was an important measure — it’s the 1874 version of a byte. In fact, the term ‘byte’ wasn’t defined until 1956, and by then, computers like the IBM Stretch were designed to represent data using a maximum of eight bits. This is why a byte means eight bits!

To this day, a byte is still considered to be eight bits. And as computer hardware become more complex, people created new computer architecture that easily built upon existing technology by doubling the current bit architecture and making sure it always remained a power of 2. Thus, today we have 64-bit CPUs and some character encoding methods that are 32 bits. In the next section, we will look at how ASCII represents characters by using up to 8 bits, or a byte, of data.

Representing larger amounts of data

A byte of data is defined as 8 bits, and because many files stored on computers are much bigger than this, indicating file size in the number of bytes would be unwieldy and hard to comprehend. Instead we use prefixes before ‘byte’ to represent larger numbers. Because computers use binary, each of the prefixes represents a value that is 2 to the power of 10 = 1024 times the previous one.

Value Equal to In bytes
1 kilobyte (KB) 1024 bytes 1024 bytes
1 megabyte (MB) 1024 kilobytes 1048576 bytes
1 gigabyte (GB) 1024 megabytes 1073741824 bytes
1 terabyte (TB) 1024 gigabytes 1099511627776 bytes
1 petabyte (PB) 1024 terabytes 1125899906842624 bytes

To convert between these, you need to divide by 1024 if you want to go down a step in the table to a larger prefix. For example, 256000 bytes would be 256000/1024 = 250KB. Working this out in megabytes gives 250/1024 = 0.244MB.

To go up a step in the table to a lower prefix, you need to multiply by 1024. So a three-terabyte hard drive can store 3 * 1024GB = 3072GB, or 3072 * 1024MB = 3145728MB.

These prefix values often lead to confusion, because elsewhere (in scientific subjects, for example), ‘kilo’ means 1000 times, not 1024 times; ‘mega’ means 1000000 times, not 1048576 times; and so on. To try to prevent confusion, the units representing steps of 1024 are sometimes called kibibyte, mebibyte, gibibyte, tebibyte, pebibyte, and so on. However, these terms are not always consistently used.