This corpus contains paraphrastic sentences with human annotated word/phrase
alignments. It was created by Trevor Cohn, Mirella Lapata and Chris
Callison-Burch at the University of Edinburgh in 2006/2007.
The corpus has been hand-corrected and extended with extra layers of annotation,
including named entities and syntactic parse structure. Please visit
Scott Martin's site
for this version of the data (namely Edinburgh++), and see also their
COLING paper which includes a description of the dataset.
The original sentences were drawn from three sources and annotated by two
annotators. The sentence pairs were drawn at random from the following
- the LDC's multi-translation Chinese corpus
- translations of Jules Verne's 20,000 leagues under the sea
- the MSR paraphrase corpus
The annotators were given the following annotation guidelines
, and marked up the
data using a web-based annotation tool.
Both the MTC and MSR texts are covered by licencing agreements. Please ensure
that you are covered by appropriate licences, described
Please refer to the README file for details of the file locations and formats,
and the scripts included for processing the data.
Download the corpus