Crowd-Sourcing of Human Judgments of Machine Translation Fluency


Yvette Graham
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Tim Baldwin
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Alistair Moffat
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Justin Zobel
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.


Status

Proc. 2013 Australasian Language Technology Wkshp., Brisbane, December 2013, pages 16-24.

Abstract

Human evaluation of machine translation quality is a key element in the development of machine translation systems, as automatic metrics are validated through correlation with human judgment. However, achievement of consistent human judgments of machine translation is not easy, with decreasing levels of consistency reported in annual evaluation campaigns. In this paper we describe experiences gained during the collection of human judgments of the fluency of machine translation output using Amazon's Mechanical Turk service. We gathered a large collection of crowd-sourced human judgments for the machine translation systems that participated in the WMT 2012 shared translation task, collected across a range of eight different assessment configurations to gain insight into possible causes of -- and remedies for -- inconsistency in human judgments. Overall, approximately half of the workers carry out the human evaluation to a high standard, but effectiveness varies considerably across different target languages, with dramatically higher numbers of good quality judgments for Spanish and French, and the reverse observed for German.

Full text

http://aclweb.org/anthology//U/U13/U13-1004.pdf