Can Machine Translation Systems be Evaluated by the Crowd Alone?


Yvette Graham
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Tim Baldwin
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Alistair Moffat
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

Justin Zobel
Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.


Status

Journal of Natural Language Engineering, 23(1):3-30, 2017.

Abstract

Crowd-sourced assessments of machine translation quality allow evaluations to be carried out cheaply and on a large scale. It is essential, however, that the crowd's work be filtered to avoid contamination of results through the inclusion of false assessments. One method is to filter via agreement with experts, but even amongst experts agreement levels may not be high. In this paper, we present a new methodology for crowd-sourcing human assessments of translation quality, which allows individual workers to develop their own individual assessment strategy. Agreement with experts is no longer required, and a worker is deemed reliable if they are consistent relative to their own previous work. Individual translations are assessed in isolation from all others in the form of direct estimates of translation quality. This allows more meaningful statistics to be computed for systems and enables significance to be determined on smaller sets of assessments. We demonstrate the methodology's feasibility in large-scale human evaluation through replication of the human evaluation component of WMT shared translation task for two language pairs, Spanish-English and English-Spanish. Results for measurement based solely on crowd-sourced assessments show system rankings in line with those of the original evaluation. Comparison of results produced by the relative preference approach and the direct estimate method described here demonstrate that the direct estimate method has a substantially increased ability to identify significant differences between translation systems.

Full text

http://dx.doi.org/10.1017/S1351324915000339