Measurement of Progress in Machine Translation


Yvette Graham
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Tim Baldwin
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Aaron Harwood
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Alistair Moffat
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Justin Zobel
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.


Status

Proc. 2012 Australasian Language Technology Wkshp., Dunedin, New Zealand, December 2012, pages 70-78.

Abstract

Machine translation (MT) systems can only be improved if their performance can be reliably measured and compared. However, measurement of the quality of MT output is not straightforward, and, as we discuss in this paper, relies on correlation with inconsistent human judgments. Even when the question is captured via "is translation A better than translation B" pairwise comparisons, empirical evidence shows that inter-annotator consistency in such experiments is not particularly high; for intra-judge consistency -- computed by showing the same judge the same pair of candidate translations twice -- only low levels of agreement are achieved. In this paper we review current and past methodologies for human evaluation of translation quality, and explore the ramifications of current practices for automatic MT evaluation. Our goal is to document how the methodologies used for collecting human judgments of machine translation quality have evolved; as a result, we raise key questions in connection with the low levels of judgment agreement and the lack of mechanisms for longitudinal evaluation.

Full text

http://www.aclweb.org/anthology/U/U12/U12-2010 .