Composition and structure
The Oxford English Corpus is based mainly on material collected from pages on the World Wide
Web, and some other online sources. Some printed texts, such as academic journals, have been
used to supplement certain subject areas.
The extensive use of web pages enables us to build a corpus of unprecedented scale and diversity.
The Oxford English Corpus is intended to be as wide-ranging as possible in its representation of
the English language. Development was planned to ensure a balanced range of material from
different subject areas, regions of the world, and text types. Structuring a corpus in this way
produces a panoramic view of language use in every area of human life.
Subject areas
The corpus is divided into 20 major subject areas or subcorpora, as shown below (figures indicate
millions of words):
Each subcorpus is further divided into a series of more specific categories. For example, the sport subcorpus is divided into about 40 individual sports including baseball, basketball, sailing, soccer, etc. This makes it possible to explore the language of a particular subject area, or to compare two subject areas, or to investigate how the behaviour of a word changes in different contexts.
English around the world
The Oxford English Corpus is dominated by British and US English, which together make up 80%
of all text.The remaining 20% (over 200 million words) is made up of varieties of English from around
the world: Australian, South African, Canadian, Caribbean, etc. It also includes material from regions
like India, Singapore, and Hong Kong, where English is often a second language.The geographical
range of the corpus is crucial for building a detailed picture of English as a global language.
Text types and register
Text type or register refers to the different levels of language that may be used in different contexts. For example, writing about soccer may range from the formal (official regulations) to the very informal (fans' weblogs or chatroom discussions).The Oxford English Corpus has been carefully composed to ensure that the full range of registers is represented.The following list indicates some of the kinds of writing that are represented in the corpus:
- academic papers
- technical manuals
- journals
- newspaper reports, columns, and opinion pieces
- corporate websites
- magazine articles
- novels and short stories
- fanzines
- underground and counterculture websites
- personal websites
- weblogs
- chatroom and newsgroup postings
Journals, newspapers, and magazines are valuable for building a picture of norms and standards in
English usage.Weblogs and newsgroups, which are largely unedited, are a rich resource for examining
non-standard language such as slang, regionalism, and neologisms. For dictionary editors providing
guidance on standard English, these sources also provide a good way of tracking common errors in
written English (e.g. spelling mistakes or meaning confusion), which can then be used for writing
properly targeted extra usage notes. Some of today's 'mistakes' in informal chatroom and weblog
contexts will inevitably lead to changes in standard usage.The range of text types used in the Oxford
English Corpus allows us to identify very precisely how language develops and how standards shift.
Date
The material in the Oxford English Corpus dates from the period 2000-2006.The corpus is unusual
in providing such an extensive representation of very recent English. It gives accurate information
about real language use today.
As the corpus continues to develop, with new material added each year, it will be possible to trace
language change over time: words becoming more or less common, features spreading from one region
to another, and the emergence of new meanings.
Printer friendly version
|