Continuous Measurement Scales in Human Evaluation of Machine Translation

Yvette Graham
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Tim Baldwin
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Alistair Moffat
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Justin Zobel
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.


Proc. 7th Linguistic Annotation Wrkshp. & Interoperability with Discourse, Sofia, Bulgaria, August 2013, pages 33-41.


We explore the use of continuous rating scales for human evaluation in the context of machine translation evaluation, comparing two assessor-intrinsic quality-control techniques that do not rely on agreement with expert judgments. Experiments employing Amazon's Mechanical Turk service show that quality-control techniques made possible by the use of the continuous scale show dramatic improvements to intra-annotator agreement of up to +0.101 in the kappa coefficient, with inter-annotator agreement increasing by up to +0.144 when additional standardization of scores is applied.

Full text