Skip main navigation

The essentials of character encoding

Text has been encoded in many different ways over the years. Watch Catilyn Merry explain give a brief history of these.
2.3
Now you’ve got a grasp on converting numbers into binary, let’s explore how computers turn these Bits into text using character encoding. Morse code is an early form of character encoding. Telegraph technology used Morse code to communicate letters and numbers through a series of dots and dashes. [TAPPING] So “hello” in Morse code looks and sounds like this. [BEEPING] […. H . E .-.. L .-.. L — O] However, it’s difficult for computers to mechanically process characters with so many different variations in length. To fix this problem, in 1874, Émile Baudot invented a 5-bit code to represent characters. [An image of a card with a row of small holes, with space for up to two holes above and three holes below.
48.6
Different holes are punched out above and below each small hole.] Using this [five key] keyboard, you can encode up to 60 different characters. But this was still limiting, as you couldn’t have characters like uppercase letters. As the decades went by, our technology became more complex. By 1956, computers like the IBM Stretch represented data using 8 bits. 8 bits became known as a byte of data, a term we still use today. As computer hardware got even more complex, people would build upon previous technology by doubling the bit architecture, but it always remained a power of 2. Today we have 64-bit CPUs and some character encoding methods that are 32 bits. In the next step, you’ll learn about two modern-day character
96
encoding standards which use bytes of data: ASCII and Unicode.

Computers do not store digital media as letters, numbers, sounds, and pictures. Instead, computers work with bits, binary digits that have a value of 1 or 0 (on or off). To get from these bits to all the things you see thanks to computers, characters (for example letters) shown on-screen need to be encoded.

Here, we will go through a little history of how computers encode language.

What is character encoding?

To understand character encoding, let’s go back to 1836 when the Morse code was invented. Morse code uses telegraph technology to represent data electronically with switches over long distances.

Instead of using ones and zeros, however, Morse code uses dots ., dashes -, and pauses, all of which we can describe as “symbols”.

For example, to say “hello” in Morse code, you would use this code:

 

 

Character Morse code
H ….
E .
L .-..
L .-..
O – – –

 

We have come a long way since representing letters in Morse code, but the principles are still very similar: we encode information in different ways to represent data and bring it to life with computers. Let’s look at how we have progressed from Morse code to all the text you see on displays and screens.

 

Bits and bytes

 

If you look at the Morse code representation of ‘HELLO’, you can see different lengths of the codes: H and L are encoded by four symbols, E by one symbol, and O by three symbols.

This is OK if you want human operators to understand the code, such as the people who send messages by telegraph. However, when you need alphabets or characters to be processed mechanically by a computer, you can run into issues with so many variations in code length.

 

In 1874, to fix this problem, Émile Baudot invented a five-bit code to represent characters. With this code, you can get up to 60 characters encoded using the keyboard below.

It is an example of the Baudot code that was produced by this keyboard.

 

Baudot keyboard and tape

 

However, being able to encode 60 characters is not enough to cover characters like uppercase letters and some numbers.

 

Because Baudot code represents each character by five bits, five bits was an important measure — it’s the 1874 version of a byte. In fact, the term ‘byte’ wasn’t defined until 1956, and by then, computers like the IBM Stretch were designed to represent data using a maximum of eight bits. This is why a byte means eight bits!

 

Photograph of the IBM Stretch

 

To this day, a byte is still considered to be eight bits. And as computer hardware become more complex, people created new computer architecture that easily built upon the existing technology by doubling the current bit architecture and making sure it always remained a power of 2. Thus, today we have 64-bit CPUs and some character encoding methods that are 32 bits. In the next section, we will look at how ASCII represents characters by using up to 8 bits, or a byte, of data.

 

Representing larger amounts of data

 

A byte of data is defined as 8 bits, and because many files stored on computers are much bigger than this, indicating file size in the number of bytes would be unwieldy and hard to comprehend. Instead, we use prefixes before ‘byte’ to represent larger numbers.

Because computers use binary, each of the prefixes represents a value that is 2 to the power of 10 = 1024 times the previous one.

 

 

Value Equal to In bytes
1 kilobyte (KB) 1024 bytes 1024 bytes
1 megabyte (MB) 1024 kilobytes 1048576 bytes
1 gigabyte (GB) 1024 megabytes 1073741824 bytes
1 terabyte (TB) 1024 gigabytes 1099511627776 bytes
1 petabyte (PB) 1024 terabytes 1125899906842624 bytes

 

To convert between these, you need to divide by 1024 if you want to go down a step in the table to a larger prefix. For example, 256000 bytes would be 256000/1024 = 250KB. Working this out in megabytes gives 250/1024 = 0.244MB.

 

To go up a step in the table to a lower prefix, you need to multiply by 1024. So a three-terabyte hard drive can store 3 * 1024GB = 3072GB, or 3072 * 1024MB = 3145728MB.

These prefix values often lead to confusion, because elsewhere (in scientific subjects, for example), ‘kilo’ means 1000 times, not 1024 times; ‘mega’ means 1000000 times, not 1048576 times; and so on.

To try to prevent confusion, the units representing steps of 1024 are sometimes called kibibyte, mebibyte, gibibyte, tebibyte, pebibyte, and so on. However, these terms are not always consistently used.

This article is from the free online

Data Representation in Computing: Bring Data to Life

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education