Continuous Measurement Scales in Human Evaluation of
Machine Translation
Yvette Graham
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Tim Baldwin
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Alistair Moffat
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Justin Zobel
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Status
Proc. 7th Linguistic Annotation Wrkshp.
& Interoperability with Discourse,
Sofia, Bulgaria, August 2013, pages 33-41.
Abstract
We explore the use of continuous rating scales for human evaluation
in the context of machine translation evaluation, comparing two
assessor-intrinsic quality-control techniques that do not rely on
agreement with expert judgments.
Experiments employing Amazon's Mechanical Turk service show that
quality-control techniques made possible by the use of the continuous
scale show dramatic improvements to intra-annotator agreement of up
to +0.101 in the kappa coefficient, with inter-annotator agreement
increasing by up to +0.144 when additional standardization of scores
is applied.
Full text
http://www.aclweb.org/anthology/W13-2305