Measurement of Progress in Machine Translation
Yvette Graham
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Tim Baldwin
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Aaron Harwood
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Alistair Moffat
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Justin Zobel
Department of Computing and Information Systems,
The University of Melbourne,
Victoria 3010, Australia.
Status
Proc. 2012 Australasian Language Technology Wkshp.,
Dunedin, New Zealand, December 2012, pages 70-78.
Abstract
Machine translation (MT) systems can only be improved if their
performance can be reliably measured and compared.
However, measurement of the quality of MT
output is not straightforward, and, as we discuss in this paper,
relies on correlation with inconsistent human judgments.
Even when the question is captured via "is translation A
better than translation B" pairwise comparisons,
empirical evidence shows that inter-annotator consistency in such
experiments is not particularly high; for intra-judge consistency
-- computed by showing the same judge the same pair of candidate
translations twice --
only low levels of agreement are achieved.
In this paper we review current and past methodologies for human
evaluation of translation quality, and explore the ramifications of
current practices for automatic MT evaluation.
Our goal is to document how the methodologies used for collecting
human judgments of machine translation quality have evolved; as a
result, we raise key questions in
connection with the low levels of judgment agreement and the lack of
mechanisms for longitudinal evaluation.
Full text
http://www.aclweb.org/anthology/U/U12/U12-2010
.