What makes an Oxford Dictionary?

People find dictionary-making fascinating. The 250th anniversary last year of Samuel Johnson's Dictionary was widely celebrated, and the recent BBC television series Balderdash and Piffle had a huge response to its call to viewers to help track down elusive word and phrase origins. But how are dictionaries written today? And how do you know that what is included in a dictionary is accurate and up to date?

Oxford English Corpus - language research based on real evidence

Oxford Dictionaries are continually monitoring and researching how language is evolving. The Oxford English Corpus is central to the process and to Oxford's £35 million research programme - the largest language research programme in the world.

What is a corpus?

A corpus is a collection of texts of written (or spoken) language presented in electronic form. It provides the evidence of how language is used in real situations, from which lexicographers can write accurate and meaningful dictionary entries. The Oxford English Corpus is at the heart of dictionary-making in Oxford in the 21st century and ensures that we can track and record the very latest developments in language today. By analysing the corpus and using special software, we can see words in context and find out how new words and senses are emerging, as well as spotting other trends in usage, spelling, world English, and so on.

Using the corpus enables lexicographers to examine one word in detail by looking at all the different contexts in which it occurs. Below is a typical way of viewing the results of a search of the corpus, using a display format called KWIC (or 'key word in context'):

corpus search example
The full picture

The Oxford English Corpus gives us the fullest, most accurate picture of the language today. It represents all types of English, from literary novels and specialist journals to everyday newspapers and magazines and from Hansard to the language of chatrooms, emails, and weblogs. And, as English is a global language, used by an estimated one third of the world's population, the Oxford English Corpus contains language from all parts of the world - not only from the UK and the United States but also from Australia, the Caribbean, Canada, India, Singapore, and South Africa. It is the largest English corpus of its type: the most representative slice of the English language available.

The corpus reaches new heights

In Spring 2006 a milestone is reached: the corpus now contains over 2 billion words of real 21st century English. It is not only size that matters, though: it is the size of the corpus coupled with the careful selection and development of its contents which means that it is a resource unlike any other anywhere in the world.

Two billion words?

If all the words in the Oxford English Corpus were laid out end to end (measuring on average 1cm), the total would stretch a greater distance than from the northern tip of Scotland to the south tip of New Zealand. Because the corpus is a collection of texts, there are not two billion different words: the humble word 'the', the commonest in the written language, accounts for almost 100 million of all the words in the corpus!

Keeping track of our language

Meanings of words and phrases change and so do spellings, despite the existence of 'standard' or 'correct' spelling. A strength of the corpus is that it contains not only published works in which the text has been edited (and made to conform to standard spellings and grammar) but also unpublished and unedited writing like emails and weblogs. Some of the most inventive uses and deliberate exploitations of language, not to mention common-or-garden mistakes, start out in this kind of informal and unselfconscious language, so tracking them is an essential part of tracking the language as a whole.


Next page: Composition and structure


Thu, 13 Apr 2006 11:33:31