The Oxford English Corpus is based mainly on material collected from pages on the World Wide Web, and some other online sources. Some printed texts, such as academic journals, have been used to supplement certain subject areas.
The extensive use of web pages enables us to build a corpus of unprecedented scale and diversity. The Oxford English Corpus is intended to be as wide-ranging as possible in its representation of the English language. Development was planned to ensure a balanced range of material from different subject areas, regions of the world, and text types. Structuring a corpus in this way produces a panoramic view of language use in every area of human life.
The corpus is divided into 20 major subject areas or subcorpora, as shown below (figures indicate millions of words):

Each subcorpus is further divided into a series of more specific categories. For example, the sport subcorpus is divided into about 40 individual sports including baseball, basketball, sailing, soccer, etc. This makes it possible to explore the language of a particular subject area, or to compare two subject areas, or to investigate how the behaviour of a word changes in different contexts.
The Oxford English Corpus is dominated by British and US English, which together make up 80% of all text.The remaining 20% (over 200 million words) is made up of varieties of English from around the world: Australian, South African, Canadian, Caribbean, etc. It also includes material from regions like India, Singapore, and Hong Kong, where English is often a second language.The geographical range of the corpus is crucial for building a detailed picture of English as a global language.
Text type or register refers to the different levels of language that may be used in different contexts. For example, writing about soccer may range from the formal (official regulations) to the very informal (fans' weblogs or chatroom discussions).The Oxford English Corpus has been carefully composed to ensure that the full range of registers is represented.The following list indicates some of the kinds of writing that are represented in the corpus:
Journals, newspapers, and magazines are valuable for building a picture of norms and standards in English usage.Weblogs and newsgroups, which are largely unedited, are a rich resource for examining non-standard language such as slang, regionalism, and neologisms. For dictionary editors providing guidance on standard English, these sources also provide a good way of tracking common errors in written English (e.g. spelling mistakes or meaning confusion), which can then be used for writing properly targeted extra usage notes. Some of today's 'mistakes' in informal chatroom and weblog contexts will inevitably lead to changes in standard usage.The range of text types used in the Oxford English Corpus allows us to identify very precisely how language develops and how standards shift.
The material in the Oxford English Corpus dates from the year 2000 onwards. New text is continuously collected, with a new batch added to the Corpus database every few months. The corpus is unusual in providing such an extensive representation of very recent English. It gives accurate information about real language use today.
As the corpus continues to develop, with new material added each year, it will be possible to trace language change over time: words becoming more or less common, features spreading from one region to another, and the emergence of new meanings.
Next page: Facts about the language