Technical information
Web crawling and text processing
Documents in the Oxford English Corpus consist of a series of text segments derived from closely
related pages within a single website. Pages are considered related when they link to each other
and all discuss a particular topic, or are all written by a particular author etc.Text is collected
using a custom-built web crawler. A configuration file is used to direct the crawler to a particular
website (or an area of a website) and to define the behaviour of the crawler within that site: the
navigational route it should follow, and the type of pages it should collect along that route.
Collecting text in this way is more labour-intensive than randomized or exhaustive crawling methods. Collection of each document requires a new entry to be added to the configuration file in order to specify the crawler's route and behaviour. However, this approach has two important benefits. Firstly, it means that metadata (domain, year, author, etc.) can be accurately defined in advance. Secondly, it facilitates removal of 'boilerplate' text: boilerplate can be identified by comparing a cluster of related pages and looking for similar HTML strings.
Having collected and boilerplate-stripped a series of web pages, the pages are stripped of tags, links, and other coding, and normalized to plain-text ASCII. The text is then tokenized, annotated for part of speech, and parsed. Finally, the annotated text is converted to XML, and document metadata is
added.
Metadata
Each document has the following metadata:
- title
- author (if known; many websites make this difficult to determine reliably)
- author gender (if known)
- language type (e.g. British English,American English)
- source website
- year (+ date, if known)
- date of collection
- domain + subdomain
- document statistics (number of tokens, sentences, etc.)
In addition, each page within a document has metadata giving the URL of the source webpage.
Tagging and parsing
Each token is annotated with its lemma and its part-of-speech tag (drawn from the Penn Treebank tag
set). Sentences are then shallow-parsed to bracket token sequences into noun and verb groups.
Corpus analysis
The principal tool used for analysis of the Oxford English Corpus is the Sketch Engine, software developed by Lexical Computing Ltd: see www.sketchengine.co.uk.
Printer friendly version
|