Student's t-Distribution: On Measuring the Inter-Rater Reliability When the Observations are Scarce

1Citations
Citations of this article
15Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In natural language processing (NLP) we always rely on human judgement as the golden quality evaluation method. However, there has been an ongoing debate on how to better evaluate inter-rater reliability (IRR) levels for certain evaluation tasks, such as translation quality evaluation (TQE), especially when the data samples (observations) are very scarce. In reality, practitioners need to be able to assess the reliability of human MT quality evaluation based on one, two, or maximum three human linguists' judgements. In this work, we first introduce the little-known method to estimate the confidence interval for the measurement value when only one data (evaluation) point is available. This leads to our example with two human-generated observational scores, for which we describe "Student's t-Distribution", and explain how to use it to measure the IRR score using only these two data points, and calculate the confidence interval (CI) of the quality evaluation. We give a quantitative analysis of how the evaluation confidence can be greatly improved by introducing more observations, even if only one extra observation. We encourage practitioners and researchers to report their IRR scores and confidence intervals in all evaluations, e.g. using Student's t-Distribution method whenever possible; thus making the NLP evaluation more meaningful, transparent, and trustworthy.

Cite

CITATION STYLE

APA

Gladkoff, S., Han, L., & Nenadic, G. (2023). Student’s t-Distribution: On Measuring the Inter-Rater Reliability When the Observations are Scarce. In International Conference Recent Advances in Natural Language Processing, RANLP (pp. 419–428). Incoma Ltd. https://doi.org/10.26615/978-954-452-092-2_047

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free