Skip to 0 minutes and 2 secondsIn 1963 the American Standard Code for Information Interchange or ASCII was adopted so that information could be translated between computers. It was designed to create an international standard for encoding the Latin alphabet; turning binary numbers into the text on your computer screen. ASCII encodes characters into seven bits of binary data. Since each bit can either be a 1 or a 0, that gives a total of 128 possible combinations. Each of these binary numbers can be converted to denary number from 0 through to 127. For example 1000001 in binary equals 65 in denary. In ASCII, each denary number corresponds to a character that we want to encode. From upper and lower case letters to numbers, symbols, and computer commands.

Skip to 0 minutes and 56 secondsFor example 65 equals uppercase A. Lowercase j equals 106, or 1101010 in binary. Or 0100001 equals 33, which encodes the exclamation mark symbol. Here's how hello is encoded into binary using ASCII. [H 1001000 E 1000101 L 1001100 L 1001100 O 1001111] But what if we're using 8-bit bytes? We simply put 0 at the front of the binary number. So in 8-bit, hello looks like this. Let's look at all this in practise. On your computer open up a Notepad text editor. Type a message, "Data is beautiful.", and save it.

Skip to 1 minute and 49 secondsLook at the size of the file. 18 bytes. Now add another word, Data is so beautiful. You've added three new characters S, O, and space. If you look at the file size again, you'll see that it's increased by 3 bytes. So ASCII uses 7 bits to represent 128 characters. But when 8-bit computers were developed the extra digit meant that 256 characters could now be encoded. Problems arose when countries began using these extra characters inconsistently. So different numbers represented different characters in different languages. Japan created multiple systems for encoding their language, which vary depending on the hardware. Messages sent from one Japanese computer to another became garbled and unreadable when the computer translated the data incorrectly.

Skip to 2 minutes and 41 secondsMistakes in Japanese character transformation became such a problem that they even have a name for it, mojibake. This problem became far worse with the invention of the worldwide web. To deal with the issues caused by sending documents in different languages all around the world, a consortium was established to create a worldwide standard, Unicode. Like ASCII, Unicode assigns each character a specific number. Unicode also uses the old ASCII encoding for the English language. So uppercase A is still 65. But Unicode encodes far more than 100,000 characters across most languages. To do this it doesn't use 8 bits of data, it uses 32. But 65 encoded into 32 bits looks like this, which wastes a lot of space.

Skip to 3 minutes and 32 secondsAlso many older computers interpret eight zeros in a row as the end of a string of characters, also called a null. Meaning they won't send any characters that come afterwards. The Unicode encoding method, UTF8 solves these problems. Up until number 127 the ASCII value stays the same. So A is still 01000001. For anything higher than 127, UTF8 separates the code into two bytes. It adds 110 to the first byte, and 10 to the second byte. Then you just fill in the binary for the bits in between. For example, the number 325 equals 00101000101, which slots in like this. That works for the first 4,096 characters. After that another byte is added.

Skip to 4 minutes and 28 secondsAnd another 1 is added at the beginning of the first byte, like this. This gives you 16 extra bits for your binary code. In fact, you can go up to 7 bytes of data which looks like this. So UTF8 avoids the 8-zero problem. And it's backwards compatible with the old ASCII system. And that's a summary of ASCII and UTF8, two important standards that have defined how characters are encoded from ones and zeros into the digital text you view everyday.

Character encoding now

Two character encoding standards define how characters are decoded from ones and zeros into the text you see on the screen right now, and into the different languages viewed every day on the World Wide Web. These two encoding standards are ASCII and Unicode.

ASCII

The American Standard Code for Information Interchange (ASCII) was developed to create an international standard for encoding the Latin alphabet. In 1963, ASCII was adopted so information could be interpreted between computers; representing lower and upper letters, numbers, symbols, and some commands. Because ASCII is encoded using ones and zeros, the base 2 number system, it uses seven bits. Seven bits allow 2 to the power of 7 = 128 possible combinations of digits to encode a character.

ASCII therefore made sure that 128 important characters could be encoded:

ASCII table, downloadable as a PDF at the end of the step This table can be downloaded as a PDF at the end of the step

How encoding ASCII works

  • You already know how to convert between denary and binary numbers
  • You now need to turn letters into binary numbers
  • Every character has a corresponding denary number (for example, A → 65)
  • ASCII uses 7 bits
  • We use the first 7 columns of the conversion table to create 128 different numbers (from 0 to 127)

For example, 1000001 gives us the number 65 (64 + 1), which corresponds to the letter ‘A’.

64 32 16 8 4 2 1
1 0 0 0 0 0 1

Here’s how ‘HELLO’ is encoded in ASCII in binary:

Latin character ASCII
H 1001000
E 1000101
L 1001100
L 1001100
O 1001111

Let’s apply this theory in practice:

  1. Open Notepad, or whichever plain text editor you prefer
  2. Type a message and save it, e.g. ‘data is beautiful’
  3. Look at the size of the file — mine is 18 bytes
  4. Now, add another word, e.g. ‘data is SO beautiful’
  5. If you look at the file size again, you’ll see that it has changed — my file is now 3 bytes larger (SO[SPACE]: the ‘S’, the ‘O’, and the space)

Unicode and UTF-8

Because ASCII encodes characters in 7 bits, moving to 8-bit computing technology meant there was one extra bit to be used. With this extra digit, Extended ASCII encoded up to 256 characters. However, the problem that developed was that countries that used different languages did different things with this extra capacity for encoding. Many countries added their own additional characters, and different numbers represented different characters in different languages. Japan even created multiple systems of encoding Japanese depending on the hardware, and all of these methods were incompatible with each other. So when a message was sent from one computer to another, the received message could become garbled and unreadable; the Japanese character encoding systems were so complex that even when a message was sent from one type of Japanese computer to another, something called ‘Mojibake’ would happen:

An example of a Mojibake on a webpage - the text is a mixture of random symbols that make no sense

The problem of incompatible encoding systems became more urgent with the invention of the World Wide Web, as people shared digital documents all over the world, using multiple languages. To address the issue, the Unicode Consortium established a universal encoding system called Unicode. Unicode encodes more than 100000 characters, covering all the characters you would find in most languages. Unicode assigns each characters a specific number, not to a binary digit. But there were some issues with this, for example:

  1. To encode 100000 characters, around 32 binary digits would be required. Unicode uses ASCII for the English language, so A is still 65. However, encoded in 32 bits, the binary representation for the letter A would be 000000000000000000000000000000000001000001. This wastes a lot of valuable space!
  2. Many older computers interpret eight zeros in a row (a null) as the end of a string of characters. So these computers wouldn’t send any characters that came after eight zeros in a row (they wouldn’t send an A if it was represented as 000000000000000000000000000000000001000001).

The Unicode encoding method UTF-8 solves these problems: - Up to character number 128, the regular ASCII value is used (so for example A is 01000001) - For any character beyond 128, UTF-8 separates the code into two bytes and adding ‘110’ to the start of first byte to show that it is a beginning byte, and ‘10’ to the start of second byte to show that it follows the first byte.

So, for each character beyond number 128, you have two bytes:

[110xxxxx] [10xxxxxx]

And you just fill in the binary for the number in between:

[11000101] [10000101] (that's the number 325 → 00101000101)

This works for the first 4096 characters. For characters beyond that, one more ‘1’ is added at the beginning of the first byte and a third byte is also used:

[1110xxxx] [10xxxxxx] [10xxxxxx]

This gives you 16 spaces for binary code. In this manner, UTF-8 goes up to four bytes:

[11110xxx] [10xxxxxx] [10xxxxxx] [10xxxxxx]

In this way, UTF-8 avoids the problems mentioned above as well as needing an index, and it lets you decode characters from the binary form backwards (i.e. it is backwards-compatible).

Activites in class

There are many fun activities for teaching character encoding. We have included two exercises below for you to try in your classroom. What top tips do you have for teaching character encoding? Share them in the comments!

  • Translating secret messages: post a short secret message in ASCII the comment section, and translate or respond to other participants’ ASCII messages

  • Binary bracelets: create bracelets using different coloured beads to represent ones and zeros and spell out an initial or a name in ASCII

Share this video:

This video is from the free online course:

Representing Data with Images and Sound: Bringing Data to Life

Raspberry Pi Foundation