IMAGINE, if you will, a young Mark Zuckerberg circa 2003, tapping out e-mail messages from his Harvard dorm room. It’s a safe bet he never would have guessed that eight years later a multibillion-dollar lawsuit might hinge on whether he capitalized the word “Internet,” or whether he spelled “cannot” as one word or two.
But that is exactly the kind of stylistic minutiae being analyzed in a lawsuit filed by Paul Ceglia, owner of a wood-pellet fuel company in upstate New York. Mr. Ceglia says that a work-for-hire contract he arranged with Mr. Zuckerberg, then an 18-year-old Harvard freshman, entitles him to half of the Facebook fortune. He has backed up his claim with e-mails purported to be from Mr. Zuckerberg, but Facebook’s lawyers argue that the e-mail exchanges are fabrications.
When legal teams need to prove or disprove the authorship of key texts, they call in the forensic linguists. Scholars in the field have tackled the disputed origins of some prestigious works, from Shakespearean sonnets to the Federalist Papers. But how reliably can linguistic experts establish that Person A wrote Document X when Document X is an e-mail — or worse, a terse note sent by instant message or Twitter? After all, e-mails and their ilk give us a much more limited purchase on an author’s idiosyncrasies than an extended work of literature. Does digital writing leave fingerprints?
The law firm representing Mr. Zuckerberg called upon Gerald McMenamin, emeritus professor of linguistics at California State University, Fresno, to study the alleged Zuckerberg e-mails. (Normally, other data like message headers and server logs could be used to pin down the e-mails’ provenance, but Mr. Ceglia claims to have saved the messages in Microsoft Word files.) Mr. McMenamin determined, in a report filed with the court last month, that “it is probable that Mr. Zuckerberg is not the author of the questioned writings.” Using “forensic stylistics,” he reached his conclusion through a cross-textual comparison of 11 different “style markers,” including variant forms of punctuation, spelling and grammar.
But Mr. McMenamin’s report has raised eyebrows in the forensic linguistics community. Earlier this month, the outgoing president of the International Association of Forensic Linguists, Ronald R. Butters, publicly questioned whether Mr. McMenamin could actually establish that Mr. Zuckerberg likely did not write the e-mails based on such slender evidence. For example, the would-be Zuckerberg e-mails had one instance of uncapitalized “internet,” while a sample of e-mails known to be sent by Mr. Zuckerberg had two capitalized instances of “Internet.” “Are we really doing ‘scientific’ and ‘linguistic’ analysis at all when we simply note instances or absences of this or that superficial textual feature?” Mr. Butters asked.
Some experts are more optimistic. Carole E. Chaski, president of Alias Technology and executive director of the Institute for Linguistic Evidence, has taken on what she terms “the keyboard dilemma,” that is, “the problem of identifying the authorship of a document that was produced by a computer to which multiple users had access.” She has developed computer software that categorizes grammatical structures as “marked” and “unmarked”: an unmarked noun phrase, for instance, has its main noun at the end of a simple phrase (“our marriage,” “a divorce”), while a marked one has the noun in the beginning of a phrase (“anything you ask”) or in the middle (“the rest of our lives”). These aspects of a writer’s syntax are relatively stable across different styles of writing, Ms. Chaski argues. They are also less prone to technological intervention — compared to spelling and punctuation, which can be changed on the fly by spell-check and autocorrect features.
Recently, a team of computer scientists at Concordia University in Montreal took advantage of an unusual set of data to test another method of determining e-mail authorship. In 2003, the Federal Energy Regulatory Commission, as part of its investigation into Enron, released into the public domain hundreds of thousands of employee e-mails, which have become an important resource for forensic research. (Unlike novels, newspapers or blogs, e-mails are a private form of communication and aren’t usually available as a sizable corpus for analysis.)
Using this data, Benjamin C. M. Fung, who specializes in data mining, and Mourad Debbabi, a cyber-forensics expert, collaborated on a program that can look at an anonymous e-mail message and predict who wrote it out of a pool of known authors, with an accuracy of 80 to 90 percent. (Ms. Chaski claims 95 percent accuracy with her syntactic method.) The team identifies bundles of linguistic features, hundreds in all. They catalog everything from the position of greetings and farewells in e-mails to the preference of a writer for using symbols (say, “$” or “%”) or words (“dollars” or “percent”). Combining all of those features, they contend, allows them to determine what they call a person’s “write-print.”
Many linguists, however, would challenge the notion that the “fingerprint,” a supposedly unique identifier, can be metaphorically applied to writing. Surely we all have our own written quirks and mannerisms — I tend to overuse em-dashes, for instance. But there is just too much internal variability in any person’s body of writing to imagine that we could take just a bit of it — a handful of e-mails — and recognize some sort of linguistic DNA. That is all the more true when it comes to digital genres like text messages, instant messages and tweets, full of unusual spellings and innovative abbreviations, and often sensitive to the type of device we’re using.
Still, these new quantitative approaches hold out the hope of at least differentiating one author from another with a reasonable degree of confidence. This can provide the kind of reliable foundation for research that forensic stylistics as traditionally practiced cannot. Hmm, or is that “can not”?