We use cookies to give you a better experience, if that’s ok you can close this message and carry on browsing. For more info read our cookies policy.
3.14

## UNSW Sydney

Skip to 0 minutes and 12 secondsInverse proportionalities turn out to have interesting applications in surprising directions. One such was noted by an American linguist, George Zipf, who discovered that if you looked at words in the English language, they satisfied an interesting pattern in terms of frequency and rank. It was basically an inverse proportionality that he was seeing in the words, the most popular words, of the English language. After his discovery, the same pattern was surprisingly discovered in a lot of other different seemingly unrelated phenomena, including populations of cities, numbers of corporations, income levels, and so on. So there's quite a lot of surprising connections, the inverse relationship manifesting itself in surprising ways. So what is Zipf's law?

Skip to 1 minute and 13 secondsSo named after George Kingsley Zipf, who lived from 1902 to 1950. And here's his law, that in many classical texts, the frequency of a given word is inversely proportional to its rank in the frequency table. What does that mean? He's talking about the most popular or most frequently used words in the English language. So if you take a big book and you count the number of times each word appears, then the word “the” will almost certainly appear more than any other word. That's the most popular word in the English language in terms of its usage. And in fact, generally speaking, it occurs about 7% of the time. So 7% of words in a typical text are the word “the”.

Skip to 2 minutes and 3 secondsThe next most popular word is “of”, which generally speaking is around the 3.1% level, followed by "to" at 2.7%; "and" at 2.6%; "in" at 1.8%; "is", 1.2%; "for", 1.0%; and "that", 0.8%. So what did Zipf observe about these steadily decreasing numbers? He observed that the 3.1% is roughly half of the top entry, 6.8%. He observed at the next one, 2.7%, is roughly 1/3 of the top entry, 6.8%, that 2.6% is roughly 1/4 of 6.8%, that 1.8% is roughly 1/5 of 6.8%, and so on. So we can capture this relation mathematically by saying that the frequency times the rank seems to be roughly constant at around 6.8%. So here is rank. Here is frequency in that direction.

Skip to 3 minutes and 22 secondsAnd the product of the two of them is roughly 6.8%, which we could kind of represent with an inverse proportional graph like that.

# Zipf's law

In this video we discuss Zipf’s law: a curious relation that connects distributions of words and populations of cities to inverse relations. It was first noticed by George Kingsley Zipf, an American linguist, when looking at the relative frequencies of words in a large text, like the book Moby Dick.

## The most common words in English

Here are the most frequent words in the English language, along with the rough percentages of how often that word occurs in written texts. For example, the most common word, ‘the’, appears roughly $\normalsize{6.8\%}$ of the time.

Of the $\normalsize{92}$ words in the two previous paragraphs, I counted $\normalsize{9}$ uses of the word ‘the’. That is therefore somewhat above average.

Rank Word Percentage
1 the 6.8
2 of 3.1
3 to 2.7
4 and 2.6
5 in 1.8
6 is 1.2
7 for 1.0
8 that 0.8

Zipf noticed that the second most common word ‘of’ occurs about half as often as the most common word ‘the’. While the third most common word ‘to’ occurs about a third as often as ‘the’. And so on. The seventh most common word ‘for’ occurs about one seventh as often as ‘the’.

More generally, the frequency of the $\normalsize{n\text{th}}$ most common word is about $\normalsize{\frac{1}{n}}$ times the frequency of the most common word.

So a graph of the frequencies of the most common words looks roughly like this:

This distribution, remarkably, is quite stable over many different publications. For example, data provided by http://norvig.com/mayzner.html support this claim.

Furthermore it turns out that less than $\normalsize{200}$ words account for more than half of all the written words in English.

## The most common words in Russian

Frequencies for the most common words in the Russian language looks roughly like this:

Rank Word Frequency Translation
1 и 7.1 and, though
2 в 5.5 in, at
3 не 4.0 not
4 на 3.2 on, it, at, to
5 что 2.6 what, that, why
6 я 2.3 I
7 с 2.2 with, and, from, of
8 он 2.1 he
9 а 1.5 while, and, but
8 как 1.4 how, what, as, like

## Discussion

If you speak a language other than English or Russian, what are the most common words in that language? Is there a similar pattern as what Zipf noticed?

Is there any sense in applying Zipf’s law to simplify the process of learning a foreign language at a basic level? How about reading books in foreign languages?