14 January 2013

The most common English words

Here are some additional interesting bits from that interesting analysis of all of the words in Google Books (23 GB of text!).  There were almost 100,000 distinct words, mentioned a total of 743,842,922,321 times.  The embed above shows the 50 most common (the "count" is in billions of mentions).  Note the paucity of nouns.

And here is the size distribution for distinct words -

- and if you scroll down at his website, you can read a list of the 24 words with length of 20 or more that are mentioned atleast 100,000 times each.


  1. I was somewhat disappointed by Norvig's analysis! The original took 200 words from a random place in each of 100 works - which ensures that long works are not over-represented. This new analysis seems not to be so careful. As noted elsewhere, it has some indications of including "too many" technical or academic works.

    1. The original work had to take a limited sample so a single author's style did not overwhelm the results, but this is because it only sampled one hundred works. I would be surprised if Google Books had even a tenth of a percent written by any one author. I don't see how large works would skew Norvig's results in such a mass of words, but I don't know much about Google Books' corpus.

  2. Of course if you include teenage dialogue, then "like" would certainly top the list.

  3. I just had a look at the site Stan recommended if you wanted to see the list of 24 long words with at least 20 characters each which were mentioned more than 100,000 times, and was extremely surprised to find a german word among them: forschungsgemeinschaft.

    This, translated by google as "study or research group", is both an official "name" word in German organisations as well as an informal word. If you look at the first pages of a google search for it, you see a mixture of english and german sites that mention an organisation called Deutsche Forschungsgemeinschaft. I can only imagine this use of the word would account for its inclusion in the list.

