Monday, April 8, 2013

Talkin' Bout Stylometry

A discussion has recently sparked up over on Left in Lowell regarding stylometry.  Stylometry basically involves looking at patterns in someone's linguistic style in order to determine authorship -- it has been used to look at everything from Shakespeare to the Federalist Papers to anonymous letters in corporations.

It is not new, and it is not necessarily technical.

What is new, however, is the amount of writing samples that anyone can grab on the Internet.  In the analog era, I would've had to break into someone's attic and steal their hand-written letters; today, however, I just need to play around with Google a bit to get the data.

Stylometry can involve many components, including word choice, average word length of text, average sentence length, average paragraph length, punctuation, and style.  With the last element in particular, it's something that be quickly and easy seen by a human reader.

Some components of stylometry prove to be not-so-useful.  Witness this chart comparing two writing samples for letter usage frequency.  Pretty cool, huh?

Well, not really.  The problem is, all you're really seeing here is a common pattern of English.  This would be great to have with you if you were flying to LA for a "Wheel of Fortune" taping, but other than that, it's not super-useful.

The chart below, however, is far more useful.  What you're seeing there in blue is punctuation usage from the letter recently sent to the Sun in which the writer attempted an imitation attack against Jack Mitchell, a Left in Lowell blogger, using an amateurish combination of well-known "Jack-isms." The green sample below represents punctuation usage from the entire corpus of entries from the Brookside Tom blog.

Chi-square analysis of this  punctuation similarity you see above would only be explained by about 1 in 7 cases that might "just sort of occur" in a population sample.  By no means does it implicate Brookside Tom as the letter's author.  I ran samples involving other well-educated, talented writers and found similar results (for those who are little fast and loose with their punctuation, or for those who enjoy heavy dash, semi-colon, and parenthesis usage, not so much).

Word length carried more meaning.  The chi-square value of the results here showed that maybe 1 in 15 samples pulled from the general population would've looked something like this.

But that's when you have to go beyond the numbers and the graphs.  The letter-writer and Brookside Tom both eschew usage of the Oxford comma.  That can be seen with enough repetition in both the blog posting and even in the short letter that it reveals a pattern.  Both writers demonstrate an odd capitalization of non-proper nouns when they are placed inside of quotation marks.  They don't teach that kind of stuff in English class.  Anywhere.  That's where the randomness that could explain some of graphs 2 and 3 stops being so random.  Because that doesn't correlate with the aforementioned attributes, you now have to look at the entire thing differently...and the "random" factors are now diminished by entire orders of magnitude.

Both writers show a style that involves multi-paragraph screeds that wrap up with single-line conclusions that stand on their own.  Both show repeated, consistent use of independent and subordinate clauses inside of sentences, bookended by commas.  Both draw heavily on martial themes, images, and metaphors.

Admittedly, the last couple of paragraphs rely on subjective analysis that only a grammarian could love.  However, taken in their totality, and bearing in mind Brookside Tom's statements on Left in Lowell that he was not the author, they point to an extremely sophisticated imitation attack against both Jack and Brookside Tom.  Meanwhile, the true author conducted what is known as an obfuscation attack.

The letter-writer indicates that the Sun had the packet attached to the letter "in their possession for some time."  What would now be interesting to see is whether the Sun acknowledges this.  If so, in what format did this material originally arrive?  Was it also sent anonymously?  If not, how might the author have been privy to that bit of information?  Answers to those sorts of questions might help us pinpoint the actual identity of the sender. 


C R Krieger said...

For those, like me, who have no idea what an Oxford Comma is, it is the use of a comma before an "and" or "or" or some other connecting word when writing a series.  Not having done well in grammar in grammar school, I had to look it up.

Otherwise, I liked this analysis.

Regards  —  Cliff

The New Englander said...

Cliff, glad you liked it. Looking back it, I realize I should've used more numbers in the way I presented it. Unfortunately, there's a correlation between the first two attributes I showed, so it would be wrong to take the 1-in-7 and the 1-in-15 and just offhandedly say, now we're down to less than one percent odds.

It's quirky usage habits that actually do bring the odds of a random mix-up far, far below one percent. We could spend many hours looking at writing samples to see this in more detail, but in this case, that was the non-proper noun capitalization inside of quotes within the body of a text (as opposed to say, a headline, in which that would make more sense). It's "Prominent [Greeks]" much like "[Bush] Tax Cuts" or "Dew Sweepers."