The 20-volume historical Oxford English Dictionary is the largest record of words used in English, past and present. It contains words that are now obsolete or rare (such as xenagogue ‘a person who guides strangers’ and vicine ‘neighboring or adjacent’) in addition to the latest coinages such as phishing and podcast.

The second edition of the OED, published in 1989 and consisting of twenty volumes, contains more than 615,000 entries, and the third, available online, is expanding all the time, with batches of 2,500 new and revised words and phrases being added in regular quarterly updates.

How many words are there in English?

It is a question often asked, but not so easily answered. Even the OED does not set out to include every specialized technical term or slang or dialect expression ever used. New words are constantly being invented, developed from existing words, or adopted from other languages. Most will be used rarely, or only by a small group of people. This means that an unlimited number of words may occur in speech and writing which will never be recorded in even the largest dictionary.

Furthermore, what exactly is a word? Clearly we should include single units such as cat and dog. But are the plurals cats and dogs separate words? Should we include compounds such as walking stick, which are made up of two existing words? There are an almost unlimited number of such two-word compounds, which can’t all be included in a dictionary. And what about abbreviations like BBC and Dr,  or proper names such as London, Nelson, and Harry Potter: are they words? As you can see, the question is not a straightforward one.

How many words do we use?

Although it may be impossible to know the number of words in English, the Oxford English Corpus can help us assess the number of words in current use.

Instead of talking about words, it’s more useful in this context to talk about lemmas, a lemma being the base form of a word. For example, climbs, climbing, and climbed are all examples of the one lemma climb. Just ten different lemmas (the, be, to, of, and, a, in, that, have, and I) account for a remarkable 25% of all the words used in the Oxford English Corpus. If you were to read through the corpus, one word in four (ignoring proper names) would be an example of one of these ten lemmas. Similarly, the 100 most common lemmas account for 50% of the corpus, and the 1,000 most common lemmas account for 75%. But to account for 90% of the corpus you would need a vocabulary of 7,000 lemmas, and to get to 95% the figure would be around 50,000 lemmas.

The remaining 5% of the corpus consists of a very large number of lemmas which occur rarely: words like moidore or parados, which may occur only once every several million words. Like all natural languages, English consists of a small number of very common words, a larger number of intermediate ones, and then an indefinitely long ‘tail’ of very rare terms.

Vocabulary size (no. of lemmas) % of content in OEC Example lemmas
10 25% the, of, and, to, that, have
100 50% from, because, go, me, our, well, way
1000 75% girl, win, decide, huge, difficult, series
7000 90% tackle, peak, crude, purely, dude, modest
50,000 95% saboteur, autocracy, calyx, conformist
>1,000,000 99% laggardly, endobenthic, pomological

The long tail means that to account for 99% of the Oxford English Corpus you would need a vocabulary of more than a million lemmas. This would include some words which may occur only once or twice in the whole corpus: highly technical terms like chrondrogenesis or dicarboxylate, and one-off coinages like bootlickingly or unsurfworthy that people would probably understand but would be unlikely to use.

If we decide that around 90-95% of the corpus gives a reasonable idea of an average vocabulary, we are left with a figure somewhere in the range of 7,000-50,000 lemmas: say, 25,000. What does a vocabulary of this size represent? It represents the set of most significant words in English: those which occur reasonably frequently and which account for all but a small part of everything we may encounter in speech or writing. It includes all the words that we actively use in general everyday life.

It’s interesting to note that most reasonably sized dictionaries contain significantly more than 25,000 lemmas.The 11th edition of the Concise Oxford English Dictionary, for example, lists more than 75,000 single-word lemmas, which means that the majority of its entries must belong to the long tail of extremely rare words. This makes good sense: such terms occur very infrequently, but when they do they are likely to be crucial to what’s being said, and the reader might well want to look them up.The idea of a quantifiable vocabulary should be seen in this light: the words we ignore for the purposes of the exercise may be very rare, but in context they may be very important.

Are There Hidden Messages in Pronouns?

By Juliet Lapidos via Slate

Some 110 years after the publication of the Psychopathology of Everyday Life, in which Sigmund Freud analyzed seemingly trivial slips of the tongue, it’s become common knowledge that we disclose more about ourselves in conversation—about our true feelings, or our unconscious feelings—than we strictly intend. Freud focused on errors, but correct sentences can betray us, too. We all have our signature tics. We may describe boring people as “nice” or those we dislike as “weird.” We may use archaisms if we’re trying to seem smart, or slang if we’d prefer to seem cool. Every time we open our mouths we send out coded, supplementary messages about our frame of mind.

The Secret Life of Pronouns.Although much of this information is easy to decode (“nice” for “boring” won’t fool anyone), linguistic psychologist James Pennebaker suggests in The Secret Life of Pronouns that lots of data remain hidden from even the most astute human observers. “Nice” and “weird” are both content words; he’s concerned with function words such as pronouns (I, you, they), articles (a, an, the), prepositions (to, for, of), and auxiliary verbs (is, am, have). We hardly notice these bolts of speech because we encounter them so frequently. With the help of computer programs to count and scrutinize them, however, patterns emerge.

Sounds enticing; sounds, in fact, rather like a publisher’s fantasy pitch, combining the strangely long-lasting craze for language books laced with pop psychology, and the added hook, the modern touch, of a computer that observes and catalogs beyond measly human capacity: a Watson for the psychiatric establishment. To Pennebaker’s credit, his claims are fairly modest, especially when compared with those of Deborah Tannen and other practitioners of the word-sleuth genre. (He doesn’t promise that if we change our pronoun usage we’ll see tangible improvements in our social lives.) The problem is that much of what he turns up is even more modest than he seems to notice. Counting function words as they’re used in ordinary life often yields the opposite of what Freud detected in confessions from the couch: confirmation of the obvious.

The most ingenious application Pennebaker proposes for function-word analysis is lie-detection, something of a dark art. Several years ago, Pennebaker and a couple of colleagues recruited 200 students and asked them to write two essays about abortion, one espousing a true belief, the other a falsehood. They asked another group to state their true and false takes in front of a video camera. When judges were called in to figure out which was which, they were accurate 52 percent of the time. (50 percent is chance.) A computer, programmed to look for specific “markers of honesty” gleaned from previous studies, performed much better, with a 67 percent accuracy rate. Truth-tellers, Pennebaker explains, tend to use more words, bigger words, more complex sentences, more exclusive words (except, but, without, as in the sentence “I think this but not that”), and more I-words (I, me, my, etc.). Liars, apparently, trade in simple, straightforward statements lacking in specificity because—Pennebaker posits—it’s actually pretty difficult to make stuff up. They avoid self-reference because they don’t feel ownership of their expressed views.

When Pennebaker dips into the more general field of “emotion detection” (he calls it that), his word-counting feels a bit Rube Goldberg-ish. After Sept. 11, 2001, Pennebaker and a colleague saved the LiveJournal.com postings of over a thousand amateur bloggers. They found that “bloggers immediately dropped in their use of I-words” following the attacks, and that their use of we-words almost doubled. Pennebaker takes these fluctuations to mean that “shared traumas bring people together,” “shared traumas deflect attention away from the self,” and that “shared traumas, in many ways, are positive experiences” (because people feel more socially connected). The brute fact that Sept. 11 influenced pronoun usage may interest readers, but Pennebaker’s analysis merely reiterates long-held psychological dogma. (Try Googling “shared traumas bring people together.”) I can’t help but wonder if Pennebaker—albeit unconsciously—interpreted his results to match the conventional wisdom.

Perhaps that’s harsh: Certainly there’s nothing wrong with devising yet another way to elucidate common human responses, and Pennebaker’s experiments are always imaginative. Yet it’s often the case that his conclusions, especially the ones he draws from I-word usage, are heavily dependent on context and prior knowledge.

In one chapter, Pennebaker notes that Rudolph Giuliani demonstrated a dramatic increase in I-words during the late spring of 2000, when he was still mayor of New York. Pennebaker fills us in that “Giuliani’s life [was] turned upside down. … He was diagnosed with prostate cancer, withdrew from the senate race against Hillary Clinton, separated from his wife on national television … and, a few days later, acknowledged his ‘special friendship’ with Judith Nathan.” Pennebaker adds that “by early June, friends, acquaintances, old enemies, and members of the press all noticed that Giuliani seemed more genuine, humble, and warm.” So it’s reasonable to conclude that Giuliani’s ascending I-word usage reflected a “personality switch from cold and distanced to someone who [due to a few significant setbacks] was more warm and immediate.”

But we already knew that. If we didn’t, where would Pennebaker’s method leave us? He argues, at various points, that the following groups use I-words at higher rates:

1. Women
2. Followers (not leaders)
3. Truth-tellers (not liars)
4. Young
5. Poor
6. Depressed
7. Afraid (but not angry)
8. Sick

The common thread unifying these seemingly random clusters is, roughly, an enhanced focus on personal experience. Sick and depressed people dwell on their conditions and are thus more likely than their healthy counterparts to talk about themselves. Followers, in conversation with leaders, might be after something: “I was wondering if I could have a raise.” That’s pretty close to a tautology, though, and does nothing to solve the problem that, without insider information, it’s impossible to know which condition or attribute I-usage reflects. A word-count-wannabe presented with Giuliani’s speeches might deduce, erroneously, that the mayor had become more truthful, or less leaderly, or had lost money.

For obvious reasons, I’m unusually attuned to my pronoun usage at the moment, and I’ve noticed a thing or two. I start off this essay with lots of we-words (16 in the introduction), and sprinkle them throughout. With the exception of the section you’re currently reading, I drop only one self-referencing I (in the fifth paragraph). I don’t deny that this imbalance might mean something. Perhaps it indicates that, like politicians who drone on about what “we” expect from the president, or how “we” want a return to old-fashioned American values, I’m trying to imply audience agreement when, in truth, I have no clue what the audience thinks. But you don’t need to count pronouns to figure that out. You only need to know that you’re reading a book review.

The Oxford comma

By Warren Clements via The Globe and Mail

It’s official. People care about the Oxford comma. Last week’s column, in which I gave each side its best arguments and observed that I seldom use the Oxford comma, provoked an avalanche of responses. They were either smart, passionate and heartfelt or smart, passionate, and heartfelt.

To recap, the Oxford (or serial) comma is the comma that precedes the concluding “and” or “or” in a list of more than two elements: apples, peaches, and pears. Many people use the comma all the time to avoid any chance of ambiguity. In The Globe and Mail’s impromptu online poll this week, 65 per cent of respondents (5,491 votes) said they used the Oxford comma, 24 per cent said they didn’t, and 11 per cent said they didn’t care.

A few readers referred to the catchy 2008 song Oxford Comma by the pop group Vampire Weekend. The opening line, which I have taken the liberty of bowdlerizing, is: “Who gives a [hoot] about an Oxford comma?” But the target was less the comma itself than it was people with pretensions, “all your diction dripping with disdain.” I couldn’t help noticing that the CD’s liner notes made liberal use of the Oxford comma: “The Dirty Projectors, Ra Ra Riot, Yacht, Sam Rosen, and Nat Baldwin.” Someone clearly gave a hoot.

Since the Oxford comma derived its name from a rule issued in 1893 at the University Press in Oxford, one might assume that the British are particularly keen on it. Lynne Truss says otherwise in Eats, Shoots & Leaves, her bestseller about punctuation. “In Britain, where standard usage is to leave it out, there are those who put it in – including, interestingly, Fowler’s Modern English Usage. In America, conversely, where standard usage is to leave it in, there are those who make a point of removing it (especially journalists).”

Truss says she uses the Oxford comma sparingly, but when she feels like using it she fights for it. When her editor sought to remove the serial comma from a statement that punctuation marks “tell us to slow down, notice this, take a detour, and stop,” Truss “argued for that Oxford comma. It seemed to me that without the comma after ‘detour,’ this was a list of three instructions (the last a double one), not four.”

This is the whimsical treatment that drives proponents of the Oxford comma wild. Be consistent, they say. An online comment this week on The Globe’s website from “Recti” said: “Yes, it [the serial comma] sometimes impedes easy reading where the meaning of a list is understood from context and the comma is optional for understanding. However, literate people often get over that uneasy feeling and are comfortable with the certainties afforded by the serial comma.”

Truss’s book finesses the argument by using an ampersand in its title: Eats, Shoots & Leaves. But the motive has nothing to do with preventing a dust-up and everything to do with the joke from which the title was drawn. That’s why a panda is seen on the cover erasing the comma after “eats”; the punchline depends upon ambiguity. In the joke, a panda with a gun in a bar eats, shoots (someone) and leaves. When the bartender consults the dictionary, he reads: “Panda. Native to China. Eats shoots and leaves.”

Readers’ responses were similarly full of puns. “I can’t stand these leftist Oxford commanists!” wrote online contributor “Skinny Dipper.” Bob McGowan winced at last week’s reference to “the comma before the storm.” He wrote: “Surely this is a misprint. Please confirm that it is. For you to say that you must have been in a comma when you wrote it is not an acceptable answer.”

All I can say is that, when asked how I feel about my inconsistent usage, I reply, “Comma ci, comma ça.”

Sentence Structure

Experienced writers use a variety of sentences to make their writing interesting and lively. Too many simple sentences, for example, will sound choppy and immature while too many long sentences will be difficult to read and hard to understand.

It is helpful to learn the definitions of simple, compound, and complex sentences so that you can identify the sentence structure in your writing. Then you will be able to incorporate more complex sentences varieties into your essays and reports.

