08 January 2013

ETAOIN SHRDLU is now ETAOIN SRHLDCU

Many are familiar with ETAOIN SHRDLU, the nonsense string that used to appear in print because of early-20thC printer design and now serves as shorthand for the most popular letters.

Now Google’s director of research Peter Norvig has used the vast data from the Google Books corpus – over 743 billion words – to produce updated word- and letter-frequency tables.
As a Scrabble player, I find it interesting that the letter "H" is relatively overvalued, and the "B" undervalued.

Image and text from Sentence First, citing Norvig's work, which includes a wealth of data deserving of a separate post (later)(sigh).

2 comments:

  1. From this observation, it's a good place to start if you want to Huffman encode text.

    ReplyDelete
  2. I'd be curious if they've done analyses on changes in frequency over time (of publication)

    ReplyDelete