Sign up & Download
Sign in

How reliable are annotations via crowdsourcing? a study about inter-annotator agreement for multi-label image annotation

by Stefanie Nowak, Stefan Rüger
Proceedings of the international conference on Multimedia information retrieval MIR 10 (2010)

Abstract

The creation of golden standard datasets is a costly business. Optimally more than one judgment per document is obtained to ensure a high quality on annotations. In this context, we explore how much annotations from experts differ from each other, how different sets of annotations influence the ranking of systems and if these annotations can be obtained with a crowdsourcing approach. This study is applied to annotations of images with multiple concepts. A subset of the images employed in the latest ImageCLEF Photo Annotation competition was manually annotated by expert annotators and non-experts with Mechanical Turk. The inter-annotator agreement is computed at an image-based and concept-based level using majority vote, accuracy and kappa statistics. Further, the Kendall τ and Kolmogorov-Smirnov correlation test is used to compare the ranking of systems regarding different ground-truths and different evaluation measures in a benchmark scenario. Results show that while the agreement between experts and non-experts varies depending on the measure used, its influence on the ranked lists of the systems is rather small. To sum up, the majority vote applied to generate one annotation set out of several opinions, is able to filter noisy judgments of non-experts to some extent. The resulting annotation set is of comparable quality to the annotations of experts.

Cite this document (BETA)

Available from oro.open.ac.uk
Page 1
hidden

How reliable are annotations via crowdsourcing? a study about inter-annotator agreement for multi-label image annotation

How Reliable are Annotations via Crowdsourcing?
A Study about Inter-annotator Agreement for Multi-label Image Annotation
Stefanie Nowak
Fraunhofer IDMT
Ehrenbergstr. 31
98693 Ilmenau, Germany
stefanie.nowak@idmt.fraunhofer.de
Stefan Rüger
Knowledge Media Institute
The Open University
Walton Hall, Milton Keynes, MK7 6AA, UK
s.rueger@open.ac.uk
ABSTRACT
The creation of golden standard datasets is a costly business.
Optimally more than one judgment per document is ob-
tained to ensure a high quality on annotations. In this con-
text, we explore how much annotations from experts differ
from each other, how different sets of annotations influence
the ranking of systems and if these annotations can be ob-
tained with a crowdsourcing approach. This study is applied
to annotations of images with multiple concepts. A sub-
set of the images employed in the latest ImageCLEF Photo
Annotation competition was manually annotated by expert
annotators and non-experts with Mechanical Turk. The
inter-annotator agreement is computed at an image-based
and concept-based level using majority vote, accuracy and
kappa statistics. Further, the Kendall τ and Kolmogorov-
Smirnov correlation test is used to compare the ranking of
systems regarding different ground-truths and different eval-
uation measures in a benchmark scenario. Results show that
while the agreement between experts and non-experts varies
depending on the measure used, its influence on the ranked
lists of the systems is rather small. To sum up, the majority
vote applied to generate one annotation set out of several
opinions, is able to filter noisy judgments of non-experts to
some extent. The resulting annotation set is of comparable
quality to the annotations of experts.
Categories and Subject Descriptors
D.2.8 [Software Engineering]: Metrics—complexity mea-
sures, performance measures ; H.3.4 [Information Storage
and Retrieval]: Systems and Software—Performance eval-
uation (efficiency and effectiveness)
General Terms
Experimentation, Human Factors, Measurement, Performance
Keywords
Inter-annotator Agreement, Crowdsourcing
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
MIR’10, March 29–31, 2010, Philadelphia, Pennsylvania, USA.
Copyright 2010 ACM 978-1-60558-815-5/10/03 ...$10.00.
1. INTRODUCTION
In information retrieval and machine learning, golden stan-
dard databases play a crucial role. They allow to compare
the effectiveness and quality of systems. Depending on the
application area, creating large, semantically annotated cor-
pora from scratch is a time and cost consuming activity.
Usually experts review the data and perform manual an-
notations. Often different annotators judge the same data
and the inter-annotator agreement is computed among their
judgments to ensure quality. Ambiguity of data and task
have a direct effect on the agreement factor.
The goal of this work is twofold. First, we investigate
how much several sets of expert annotations differ from each
other in order to see whether repeated annotation is neces-
sary and if it influences performance ranking in a bench-
mark scenario. Second, we explore if non-expert annota-
tions are reliable enough to provide ground-truth annota-
tions for a benchmarking campaign. Therefore, four exper-
iments on inter-annotator agreement are conducted applied
to the annotation of an image corpus with multiple labels.
The dataset used is a subset of the MIR Flickr 25,000 image
dataset [12]. 18,000 Flickr photos of this dataset annotated
with 53 concepts were utilized in the latest ImageCLEF 2009
Photo Annotation Task [19] in which 19 research teams sub-
mitted 74 run configurations. Due to time and cost restric-
tions most images of this task were annotated by only one
expert annotator. We conduct the experiments on a small
subset of 99 images. For our experiments, 11 different ex-
perts annotated the complete set, so that each image was
annotated 11 times. Further, the set was distributed over
Amazon Mechanical Turk (MTurk) to non-expert annota-
tors all over the world, who labelled it nine times. The
inter-annotator agreement as well as the system ranking for
the 74 submissions is calculated by considering each anno-
tation set as single ground-truth.
The remainder of the paper is organized as follows. Sec. 2
describes the related work on obtaining inter-annotator agree-
ments and crowdsourcing approaches for distributed data
annotation. Sec. 3 explains the setup of the experiments by
illustrating the dataset and the annotation acquisition pro-
cess. Sec. 4 details the methodology of the experiments and
introduces the relevant background. Finally, Sec. 5 presents
and discusses the results of the four experiments and we
conclude in Sec. 6.
2. RELATED WORK
Over the years a fair amount of work on how to prepare
golden standard databases for information retrieval eval-
557
Page 2
hidden
uation has been published. One important point in as-
sessing ground-truth for databases is to consider the agree-
ment among annotators. The inter-annotator agreement de-
scribes the degree of consensus and homogeneity in judg-
ments among annotators. Kilgarriff [15] proposes guidelines
on how to produce a golden standard dataset for benchmark-
ing campaigns for word-sense disambiguation. He concludes
that the annotators and the vocabulary used during anno-
tation assessment have to be chosen with care while the re-
sources should be used effectively. Kilgarriff states that it re-
quires more than one person to assign word senses, that one
should calculate the inter-annotator agreement and deter-
mine whether it is high enough. He identifies three reasons
that can lead to ambiguous annotations and suggests ways
how to solve them. Basically the reasons lie in the ambiguity
of data, poor definition of annotation scheme or mistakes of
annotators due to lack of motivation or knowledge.
To assess the subjectivity in ground-truthing in multime-
dia information retrieval evaluation, several work has been
performed on the analysis of inter-annotator agreements.
Voorhees [28] analyses the influence of changes in relevance
judgments on the evaluation of retrieval results utilizing the
Kendall τ correlation coefficient. Volkmer et al. [27] present
an approach that integrates multiple judgments in the clas-
sification system and compare them to the kappa statistics.
Brants proposes in [2] a study about inter-annotator agree-
ment for part-of-speech and structural information annota-
tion in a corpus of German newspapers. He uses the accu-
racy and F-score between the annotated corpus of two an-
notators to assess their agreement. A few studies have been
performed to study the inter-annotator agreement for word
sense disambiguation [26, 5]. These studies often utilize
kappa statistics for calculating agreement between judges.
Recently, different works were presented that outsource
multimedia annotation tasks to crowdsourcing approaches.
According to Howe [10],
crowdsourcing represents the act of a company
or institution taking a function once performed
by employees and outsourcing it to an undefined
(and generally large) network of people in the
form of an open call.
Often the work is distributed over web-based platforms. Uti-
lizing crowdsourcing approaches for assessing ground-truth
corpora is mainly motivated by the reduction of costs and
time. The annotation task is divided into small parts and
distributed to a large community. Sorokin et al. [25] were
one of the first who outsourced image segmentation and la-
belling tasks to MTurk. The ImageNet database [7] was
constructed by utilizing workers at MTurk that validated if
images depict the concept of a certain WordNet node.
Some studies have been conducted that explore the an-
notation qualities obtained with crowdsourcing approaches.
Alonso and Mizarro [1] examine how well relevance judg-
ments for the TREC topic about space program can be ful-
filled by workers at MTurk. The relevance of a document
had to be judged regarding this topic and the authors com-
pared the results of the non-experts to the relevance assess-
ment of TREC. They found that the annotations among
non-expert and TREC assessors are of comparable quality.
Hsueh et al. [11] compare the annotation quality of senti-
ment in political blog snippets from a crowdsourcing ap-
proach and expert annotators. They define three criteria,
the noise level, the sentiment ambiguity, and the lexical un-
certainty, that can be used to identify high quality annota-
tions. Snow et al. [24] investigate the annotation quality for
non-expert annotators in five natural language tasks. They
found that a small number of non-expert annotations per
item yields to equal performance to an expert annotator
and propose to model the bias and reliability of individual
workers for an automatic noise correction algorithm. Kazai
and Milic-Frayling [13] examine measures to obtain the qual-
ity of collected relevance assessments. They point to several
issues like topic and content familiarity, dwell time, agree-
ment or comments of workers that can be used to derive a
trust weight for judgments. Other work deals with how to
verify crowdsourced annotations [4], how to deal with sev-
eral noisy labellers [23, 8] and how to balance pricing for
crowdsourcing [9].
Following the work of [25, 7], we obtained annotations for
images utilizing MTurk. In our experiments, these annota-
tions are acquired on an image-based level for a multi-label
scenario and compared to expert annotations. Extending
the work that was performed on inter-annotator agreement
[1, 2], we do not just analyse the inter-rater agreement, but
study the effect of multiple annotation sets on the ranking
of systems in a benchmark scenario.
3. EXPERIMENTAL SETUP
In this section, we describe the setup of our experiments.
First, the dataset used for the experiments on annotator
agreements is briefly explained. Next, the process of ob-
taining expert annotations is illustrated by outlining the de-
sign of our annotation tool and the task the experts had
to perform. Following, the acquisition process of obtaining
ground-truth from MTurk is detailed. Finally, the workflow
of posing tasks at Amazon MTurk, designing the annotation
template, obtaining and filtering the results is highlighted.
3.1 Dataset
The experiments are conducted on a subset of 99 images
from the MIR Flickr Image Dataset [12]. The MIR Flickr
Image Dataset consists of 25,000 Flickr images. It was uti-
lized for a multi-label image annotation task at the latest Im-
ageCLEF 2009 [19] competition. Altogether, 18,000 of the
images were annotated with 53 visual concepts by expert
annotators of the Fraunhofer IDMT research staff. 5,000
images with annotations were provided as training set and
the performance of the annotation systems was evaluated on
13,000 images. 19 research teams submitted a total of 74 run
configurations. The 99 images utilized in our experiments
on inter-annotator agreements and its influence on system
ranking are part of the testset of the Photo Annotation Task.
Consequently, the results of 74 system configurations in au-
tomated annotation of these images can serve as basis for
investigating the influence on ranking.
3.2 Collecting Data of Expert Annotators
The set of 99 images was annotated by 11 expert annota-
tors from the Fraunhofer IDMT research staff with 53 con-
cepts. We provided the expert annotators a definition of
each concept including example photos (see [18] for a de-
tailed description of the concepts.). The 53 concepts to be
annotated per image were ordered into several categories. In
principle, there were two different kinds of concepts, optional
concepts and mutual exclusive concepts. E.g. the category
558

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

29 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
41% Ph.D. Student
 
24% Student (Master)
 
10% Other Professional
by Country
 
21% Germany
 
14% United Kingdom
 
14% United States