The METER Corpus: A corpus for analysing journalistic text reuse

  • Gaizauskas R
  • Foster J
  • Wilks Y
 et al. 
  • 28

    Readers

    Mendeley users who have this article in their library.
  • N/A

    Citations

    Citations of this article.

Abstract

As a part of the METER (MEasuring TExt Reuse) project we have built a new type of comparable corpus consisting of annotated examples of related newspaper texts. Texts in the corpus were manually collected from two main sources: the British Press Association (PA) and nine British national newspapers that subscribe to the PA newswire service. In addition to being structured to support efficient search for related PA and newspaper texts, the corpus is annotated at two levels. First, each of the newspaper texts is assigned one of three coarse, global classifications indicating its derivation relation to the PA: wholly derived, partially derived or non-derived. Second, about 400 wholly or partially derived newspaper articles are annotated down to the lexical level, indicating for each phrase, or even individual word, whether it appears verbatim, rewritten or as new material. We envisage that this corpus will be of use for a variety of studies, including detection and measurement of text reuse, analysis of paraphrase and journalistic styles, and information extraction/retrieval. To illustrate these potential uses we briefly describe some work we have done with the corpus to develop algorithms for detecting text reuse. 1.

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

There are no full text links

Authors

  • Robert Gaizauskas

  • Jonathan Foster

  • Yorick Wilks

  • John Arundel

  • Paul Clough

  • Scott S L Piao

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free