The 20-volume historical Oxford English Dictionary is the largest record of words used in English, past and present. It contains words that are now obsolete or rare (such as xenagogue ‘a person who guides strangers’ and vicine ‘neighboring or adjacent’) in addition to the latest coinages such as phishing and podcast.
The second edition of the OED, published in 1989 and consisting of twenty volumes, contains more than 615,000 entries, and the third, available online, is expanding all the time, with batches of 2,500 new and revised words and phrases being added in regular quarterly updates.
It is a question often asked, but not so easily answered. Even the OED does not set out to include every specialized technical term or slang or dialect expression ever used. New words are constantly being invented, developed from existing words, or adopted from other languages. Most will be used rarely, or only by a small group of people. This means that an unlimited number of words may occur in speech and writing which will never be recorded in even the largest dictionary.
Furthermore, what exactly is a word? Clearly we should include single units such as cat and dog. But are the plurals cats and dogs separate words? Should we include compounds such as walking stick, which are made up of two existing words? There are an almost unlimited number of such two-word compounds, which can’t all be included in a dictionary. And what about abbreviations like BBC and Dr, or proper names such as London, Nelson, and Harry Potter: are they words? As you can see, the question is not a straightforward one.
Although it may be impossible to know the number of words in English, the Oxford English Corpus can help us assess the number of words in current use.
Instead of talking about words, it’s more useful in this context to talk about lemmas, a lemma being the base form of a word. For example, climbs, climbing, and climbed are all examples of the one lemma climb. Just ten different lemmas (the, be, to, of, and, a, in, that, have, and I) account for a remarkable 25% of all the words used in the Oxford English Corpus. If you were to read through the corpus, one word in four (ignoring proper names) would be an example of one of these ten lemmas. Similarly, the 100 most common lemmas account for 50% of the corpus, and the 1,000 most common lemmas account for 75%. But to account for 90% of the corpus you would need a vocabulary of 7,000 lemmas, and to get to 95% the figure would be around 50,000 lemmas.
The remaining 5% of the corpus consists of a very large number of lemmas which occur rarely: words like moidore or parados, which may occur only once every several million words. Like all natural languages, English consists of a small number of very common words, a larger number of intermediate ones, and then an indefinitely long ‘tail’ of very rare terms.
|Vocabulary size (no. of lemmas)||% of content in OEC||Example lemmas|
|10||25%||the, of, and, to, that, have|
|100||50%||from, because, go, me, our, well, way|
|1000||75%||girl, win, decide, huge, difficult, series|
|7000||90%||tackle, peak, crude, purely, dude, modest|
|50,000||95%||saboteur, autocracy, calyx, conformist|
|>1,000,000||99%||laggardly, endobenthic, pomological|
The long tail means that to account for 99% of the Oxford English Corpus you would need a vocabulary of more than a million lemmas. This would include some words which may occur only once or twice in the whole corpus: highly technical terms like chrondrogenesis or dicarboxylate, and one-off coinages like bootlickingly or unsurfworthy that people would probably understand but would be unlikely to use.
If we decide that around 90-95% of the corpus gives a reasonable idea of an average vocabulary, we are left with a figure somewhere in the range of 7,000-50,000 lemmas: say, 25,000. What does a vocabulary of this size represent? It represents the set of most significant words in English: those which occur reasonably frequently and which account for all but a small part of everything we may encounter in speech or writing. It includes all the words that we actively use in general everyday life.
It’s interesting to note that most reasonably sized dictionaries contain significantly more than 25,000 lemmas.The 11th edition of the Concise Oxford English Dictionary, for example, lists more than 75,000 single-word lemmas, which means that the majority of its entries must belong to the long tail of extremely rare words. This makes good sense: such terms occur very infrequently, but when they do they are likely to be crucial to what’s being said, and the reader might well want to look them up.The idea of a quantifiable vocabulary should be seen in this light: the words we ignore for the purposes of the exercise may be very rare, but in context they may be very important.