Letters, Words and the English Language
In the 1960s, Mark Mayzner culled 20,000 words from newspapers, magazines and books to study the frequency of letters and words, analyze word length and explore where letters appeared within words.
Last month he contacted Google research chief Peter Norvig to see what Norvig could do with Google’s much larger sample size and contemporary computational power.
Norvig complied, downloaded the Google books Ngrams raw data set, and came up with the following after analyzing 97,565 distinct words which were mentioned over 743 billion times.
Some takeaways:
- Word Counts: The, Of, And, To, In and A are the English language’s most popular words.
- Word Length: The average length of English words weighted by their popularity is 4.79 letters long.
- Word Length, Part II: The average length of all 97,565 distinct words is 7.6 letters long.
- Popular Letters: E, T and A are the most common letters in the English alphabet.
- Popular Letters Within Words: T most frequently begins a word, E most frequently ends a word.
Back in the day, Mayzner used IBM punchcards to sort his data. Today, Norvig used his personal computer and writes:
Here’s where you would typically see a comparison saying that if you punched the 743 billion words one to a card and stacked them up, then assuming 100 cards per inch, the stack would be 100,000 miles high; nearly halfway to the moon. But that’s silly, because the stack would topple over long before then. If I had 743 billion cards, what I would do is stack them up in a big building, like, say, the Vehicle Assembly Building (VAB) at Kennedy Space Center, which has a capacity of 3.6 million cubic meters. The cards work out to only 2.9 million cubic meters; easy peasy; room to spare. And an IBM model 84 card sorter could blast through these at a rate of 2000 cards per minute, which means it would only take 700 years per pass (but you’d need multiple passes to get the whole job done).
Read through for more findings along with Norvig’s methodology for exploring the data.
Peter Norvig, English Letter Frequency Counts: Mayzner Revisited.
Image: Letter Counts by Position Within Words, by Peter Norvig. Select to embiggen.