Evaluation campaigns and TRECVid -
Evaluation Campaigns and TRECVid Alan F. Smeaton Centre for Digital Video Proc. & Adaptive Information Cluster Dublin City University Glasnevin, Dublin 9, Ireland firstname.lastname@example.org Paul Over Information Access Division Information Technology Lab. National Institute of Standards and Technology Gaithersburg, MD. 20899, USA email@example.com Wessel Kraaij TNO Information and Communication Technology, PO BOX 5050 2600 GB Delft The Netherlands firstname.lastname@example.org ABSTRACT The TREC Video Retrieval Evaluation (TRECVid) is an international benchmarking activity to encourage research in video information retrieval by providing a large test col- lection, uniform scoring procedures, and a forum for orga- nizations 1 interested in comparing their results. TRECVid completed its fifth annual cycle at the end of 2005 and in 2006 TRECVid will involve almost 70 research organiza- tions, universities and other consortia. Throughout its ex- istence, TRECVid has benchmarked both interactive and automatic/manual searching for shots from within a video corpus, automatic detection of a variety of semantic and low-level video features, shot boundary detection and the detection of story boundaries in broadcast TV news. This paper will give an introduction to information retrieval (IR) evaluation from both a user and a system perspective, high- lighting that system evaluation is by far the most prevalent type of evaluation carried out. We also include a summary of TRECVid as an example of a system evaluation bench- marking campaign and this allows us to discuss whether such campaigns are a good thing or a bad thing. There are arguments for and against these campaigns and we present some of them in the paper concluding that on balance they have had a very positive impact on research progress. Categories and Subject Descriptors H.5.1 [Multimedia Information Systems]: [Evaluation / methodology] General Terms Algorithms, Measurement, Performance, Experimentation 1Certain commercial entities, equipment, or materials may be identified in this document in order to describe an experi- mental procedure or concept adequately. Such identification is not intended to imply recommendation or endorsement by the National Institute of Standards, nor is it intended to im- ply that the entities, materials, or equipment are necessarily the best available for the purpose. Copyright 2006 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contrac- tor or affiliate of the U.S. Government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. MIR���06, October 26���27, 2006, Santa Barbara, California, USA. Copyright 2006 ACM 1-59593-495-2/06/0010 ...$5.00. Keywords Evaluation, Benchmarking, Video Retrieval 1. INTRODUCTION Evaluation campaigns which benchmark IR tasks have be- come very popular in recent years for a variety of reasons. They are attractive to researchers because they allow com- parison of their work with others in an open, metrics-based environment. They provide shared data, common evaluation metrics and often also offer collaboration and sharing of re- sources. They are also attractive to funding agencies and outsiders because they can act as a showcase for research results. Analysis, indexing and retrieval of video shots takes place each year within the TRECVid evaluation campaign and this paper presents an overview of TRECVid and its ac- tivities. We begin, in section 2, with an introduction to evaluation in IR, covering both user evaluation and system evaluation. In section 3 we present a catalog of evaluation campaigns in the general area of IR and video analysis. Sec- tions 4 and 5 give a retrospective overview of the TRECVid campaign with attention to the evolution of the evaluation and participating systems, open issues, etc. In section 6 we discuss whether evaluation benchmarking campaigns like TRECVid, Text Retrieval Conferences (TREC) and others are good or bad. We present a series of arguments for each case and leave the reader to conclude that on balance they have had a positive impact on research progress. 2. USER EVALUATION AND SYSTEM EVAL- UATION OF IR In the early 1960s, the Cranfield College of Aeronautics wanted to test indexing techniques for text abstracts. They created test queries on a static document collection of some hundreds of documents and each document was judged as either relevant or not relevant to each of a set of user queries. Based on the combination of documents, user queries and relevance judgments, the researchers were able to evaluate different indexing and retrieval strategies using measures such as precision and recall, which are well-known and still used now. That experiment was the first experimental IR evaluation, and the empirical approach to evaluating IR tasks continues today. When we build an IR system we build it to serve one part or function in an overall information seeking task. We use a search tool, which is what an IR system is, to retrieve
documents or images or video clips in response to a specific, formulated search request, but that search request is just one stage of our overall information need. When we use an IR system, we are engaging in information seeking. It follows that what we should evaluate are things like user satisfaction and the goodness of fit of the system we are using for task completion. But we can���t do this because it would involve testing with a significant number of real users every time we want to do such an evaluation. That is prohibitively expensive to do every time we think we���ve discovered a new indexing or retrieval algorithm or we want to modify and evaluate an existing one. Such evaluations are termed user evaluations, performed from an information science viewpoint and are not common. Instead what we do is system evaluation, which is evaluation more from a computer science viewpoint and that is what is prevalent in IR research . This is summarized below System evaluation tests the quality of an IR system pro- cesses a high volume of queries has no user involvement and simulates an end-user is cheap and very popular and is a highly controlled environment. User evaluation tests the quality of IR system and its interface (usually) processes a low volume of queries has direct user involvement in the evaluation and is an artificial test The development of empirical IR research continues to use test collections of documents, queries and relevance as- sessments and has been based on system rather than user evaluation, though a small amount of the latter is carried out. As digital document collections (including texts, web pages, images, videos, music, and others) for personal and for work-related use have exploded in size, IR research came under increasing pressure to make IR evaluations realistic. The approach of manual judgments of relevance carried out in individual laboratories or by individual researchers meant that evaluations on collections of the order of thousands of documents was simply not credible as people started to use collections of millions and then billions of documents. The sheer effort, and cost, of creating a dataset which could be used for evaluation and which was credible remains beyond the resources of almost all research groups and so over the last several years we have seen the emergence of benchmark- ing evaluation campaigns which we discuss in the next sec- tion. 3. BENCHMARKING EVALUATION CAM- PAIGNS Following the realization that benchmarking IR tasks needed to scale up in size in order to be realistic, the Text Retrieval Conference (TREC) initiative began in 1991 as a reaction to small collection sizes and the need for a more coordi- nated evaluation among researchers. This was run by NIST and funded by the Disruptive Technology Office (DTO). It set out initially to benchmark the ad hoc search and re- trieval operation on text documents and over the intervening decade and a half spawned over a dozen IR-related tasks in- cluding cross-language, filtering, web data, interactive, high accuracy, blog data, novelty detection, video data, enter- prise data, genomic data, legal data, spam data, question- answering and others. 2005 was the 14th TREC workshop and 117 research groups participated. One of the evalua- tion campaigns which started as a track within TREC but spawned off as an independent activity after 2 years is the video data track, known as TRECVid and we shall give fur- ther details on TRECVid in the next section of this paper. The operation of TREC and all its tracks was established from the start and has followed the same formula, basically ��� Acquire data and distribute it to participants ��� Formulate a set of search topics and release these to participants en bloc ��� Allow about 4 weeks before accepting submissions of the top-1000 ranked documents per search topic ��� Pool submissions to eliminate duplicates and use man- ual assessors to make binary relevance judgments ��� Calculate Precision, Recall and other derived measures for submitted runs and distribute results ��� Host workshop in NIST in November ��� Make plans, and repeat the process . . . for the next 16 years ! The approach in TREC has always been metrics-based ��� focusing on evaluation of search performance ��� with mea- surement typically being some variants of Precision and Re- call. Following the success of TREC and its many tracks, many similar evaluation campaigns have been launched in the IR domain. In particular in the video/image area there are evaluation campaigns for basic video/image analysis as well as for retrieval. In all cases these are not competitions with ���winners��� and ���losers��� but they are more correctly ti- tled ���evaluation campaigns��� where interested parties can benchmark their techniques against others and normally they culminate in a workshop where results are presented and discussed. TRECVid is one such evaluation campaign and we shall see details of that in section 4. The Cross Lingual Evaluation Forum (CLEF)  is in its 7th iteration in 2006 and has 74 groups participating us- ing a total of 12 different languages. CLEF tests aspects of mono- and cross-lingual IR through a variety of 8 differ- ent tracks including mono-, bi- and multi-lingual document retrieval on news, mono- and cross-lingual retrieval on struc- tured scientific data, interactive cross-lingual retrieval and question-answering, cross-lingual image retrieval, and so on. CLEF is funded by the EU through the DELOS network. NTCIR  is like CLEF, except it addresses Asian lan- guages (Chinese, Korean and Japanese), and it is not as big. 2005 was the 6th running of NTCIR and it follows the TREC model quite faithfully. It covers multi-lingual, bi-lingual and single language retrieval on three Asian languages as well as question-answering. INEX  is the Initiative for the Evaluation of XML Re- trieval and 2006 is the 5th running of the cycle with 80 participating groups. INEX addresses IR which exploits available structural information (XML elements) to yield more focused retrieval and may retrieve a mixture of para- graphs, sections, etc. The collection used in 2006 is 659,300 Wikipedia articles from 113,483 categories with an average of 161 XML nodes each. Unlike the other evaluation cam- paigns, and to keep costs down, participants in INEX must create candidate topics in order to gain access to the docu- ment collection. The main task in INEX is ad hoc retrieval plus tasks in natural language queries, heterogeneous docu- ments, interactive, document mining and Multimedia .