AskOxford Logo Space
  VIEW BASKET  
Space Home
Space
Top Search Space Space
Bottom Space
Curve low Blue
Space
Space
HOME ·  SHOP ·  EDUCATION ·  PRESS ROOM ·  CONTACT US · 
SELECT VIEW
Space UK and the Rest of the World Space USA Space
You are currently in the US view
Space Space


Technical information

Web crawling and text processing

Documents in the Oxford English Corpus consist of a series of text segments derived from closely related pages within a single website. Pages are considered related when they link to each other and all discuss a particular topic, or are all written by a particular author etc.Text is collected using a custom-built web crawler. A configuration file is used to direct the crawler to a particular website (or an area of a website) and to define the behaviour of the crawler within that site: the navigational route it should follow, and the type of pages it should collect along that route.

Collecting text in this way is more labour-intensive than randomized or exhaustive crawling methods. Collection of each document requires a new entry to be added to the configuration file in order to specify the crawler's route and behaviour. However, this approach has two important benefits. Firstly, it means that metadata (domain, year, author, etc.) can be accurately defined in advance. Secondly, it facilitates removal of 'boilerplate' text: boilerplate can be identified by comparing a cluster of related pages and looking for similar HTML strings.

Having collected and boilerplate-stripped a series of web pages, the pages are stripped of tags, links, and other coding, and normalized to plain-text ASCII. The text is then tokenized, annotated for part of speech, and parsed. Finally, the annotated text is converted to XML, and document metadata is added.

Metadata

Each document has the following metadata:

  • title
  • author (if known; many websites make this difficult to determine reliably)
  • author gender (if known)
  • language type (e.g. British English,American English)
  • source website
  • year (+ date, if known)
  • date of collection
  • domain + subdomain
  • document statistics (number of tokens, sentences, etc.)

In addition, each page within a document has metadata giving the URL of the source webpage.

Tagging and parsing

Each token is annotated with its lemma and its part-of-speech tag (drawn from the Penn Treebank tag set). Sentences are then shallow-parsed to bracket token sequences into noun and verb groups.

Corpus analysis

The principal tool used for analysis of the Oxford English Corpus is the Sketch Engine, software developed by Lexical Computing Ltd: see www.sketchengine.co.uk.



print button Printer friendly version




The Oxford English Corpus

Language Facts

Using the Corpus

Composition and Structure

Dictionary Entries

Technical Information


Corpus Demonstrations

links
Space
Space Redarrow Space
Space
Space Redarrow Space
Space
Space Redarrow Space
Space
Space Redarrow Space
Space
Space Redarrow Space
Space
Space Redarrow Space
Space
Space dotted
CurveUp
Blue RightDown
Shorter Oxford English Dictionary Space
Dotted
Space
PRIVACY POLICY AND LEGAL NOTICE  Content and Graphics © Copyright  Oxford University Press, 2008.  All rights reserved.    
Space Oxford University Press
dotted
Space
Space