Frequency analysis

Despite the huge number of possible substitution ciphers, they're very easy to break using frequency analysis. Let's see how.

We’ve seen that there are over 403 octillion ways of permuting the 26 letters of the English alphabet. That’s 403 followed by 24 zeros. Checking one permutation per second to see if it yields an English decipherment would take 13 billion billion years.

However, we know that substitution ciphers were being broken as long ago as the 9th Century AD thanks to al-Kindi’s method based on frequency analysis. This attack was known in Europe by the 1400s.

It comes down to the fact that some letters typically appear more often in English than others do: for example, the letter E gets used much more than the letter X. Since each plaintext letter gets enciphered to the same ciphertext letter, frequency analysis allows us to make a good guess at the permutation that has been used.

Here are the relative frequencies with which each letter of the alphabet appears in written English.

We can see that E is the most common letter, followed by T and then A. So, if we’ve been given some ciphertext which we think may have been produced by applying a substitution cipher to some English plaintext, we can try to decode it by replacing the most common letter in the ciphertext with E, the next most common letter with T, and so on. Once the most common letters are in place, we can hopefully work out the less common letters “by eye”.

In some sense this shouldn’t be a surprise. Children have long been playing hangman, and it’s well known that a good strategy is to try common letters such as E before less common letters such as Z.

For this to work, you need to know what the underlying language of the plaintext is. The distribution of letters in English is very different from the distribution of letters in Welsh:

Finally, it’s worth noting some peculiar examples of texts for which the letter frequencies are far from what would be expected in a piece of English writing.

• Gadsby: A Story of Over 50,000 Words Without Using the Letter “E’‘, by Ernest Vincent Wright, was published in 1939. More recently, the 1969 novel La Disparition by Georges Perec (which was translated into English by Gilbert Adair as A Void in 1995) tells the story of some friends searching for their missing colleague, Anton Vowl, and does so without any use of the letter E whatsoever.
• Eunoia by Christian Bok is a book where each vowel appears by itself in its own chapter, so in the first chapter every word contains no vowel other than “A”. The word “Eunoia’’ is the shortest English word containing all five vowels, and it comes from the Greek for “beautiful thinking’’. It’s a rhetorical device for building goodwill with the audience.