13 January 2010

Analyzing the "linguistic fingerprints" of authors

Works by Herman Melville, Thomas Hardy, and D.H. Lawrence have been examined to see how many different words an author uses only once in that piece of writing.  Obviously, a longer work would tend to have more unique words, in a pattern that eventually forms a plateau based on the author's vocabulary, and in a shape that may be characteristic of that author.
The team suggests that a work by an unknown author could therefore be compared to prior works, with the curve acting as a linguistic "fingerprint".

"It doesn't matter if I pull out 10,000 words from a book of 100,000 or from a book of 200,000, I get the same behaviour; you always simply pull a piece out of your very, very big 'meta book', which is just a representation of your style," said Sebastian Bernhardsson, who led the work.
For an interesting comparison piece, see the post I wrote about the aging-related changes in the vocabulary of Agatha Christie.

And on a tangentially-related matter, one year ago I tested the "readability level" of this blog, results of which suggested the readership would be quite well educated.  That particular test is no longer available, so I tried a different one this morning and got the results below.  The Gunning-Fox index of 14 is a "rough measure of how many years of schooling it would take someone to understand the content" of the blog.  The test apparently just sampled the front page (last 25 posts) of TYWKIWDBI, so the number would change from time to time (and I rather suspect it also samples the sidebar, which would greatly skew the results downward).

  1. I think your sidebar may skew the results in the other direction, considering how many multi-syllablic words are contained there.

  2. I was actually thinking in terms of sentence length.

  3. Tee hee. 20 years ago, a bureaucracy that employed me decided that, OMG!, our documents were too difficult to understand. So, we used some linguistic calculations to come up with the numbers you've demonstrated. We were told to get our writing down to 7th grade level. (It was government, thus documents must be readable by all levels of readers.)

    That was easy. Just substitute a one or two syllable word for anything with more syllables and keep sentences to a noun, verb and object.

    However, when something needed to be hidden, fog factor was emphasized. Never got the hang of that...

