Automatic evaluation of metadata ...
Towards Automatic Evaluation of Metadata Quality in Digital Repositories Xavier Ochoa1 and Erik Duval2 1 Information Technology Center, Escuela Superior Politcnica del Litoral, Va Perimetral Km. 30.5, Guayaquil - Ecuador email@example.com 2 Dept. Computerwetenschappen, Katholieke Universiteit Leuven, Celestijnenlaan 200A, B-3001, Heverlee, Belgium Erik.Duval@cs.kuleuven.be Abstract. Thanks to recent developments on automatic generation of metadata and interoperability between repositories, the production, man- agement and consumption of metadata is vastly surpassing the human capacity to review or process this information. However, we need to as- sure that low quality metadata does not compromise the performance of the services that the repository provides to its users. We contend there is a need for automatic assessment of the quality of metadata in digital repositories, so tools or users can be alerted about low quality records. In this paper, we present several quality metrics for metadata based on quality evaluation frameworks used for human quality review. We applied these metrics to a sample of records from a real repository and compared the results with the quality assessment given to the same records by a group of human reviewers. Through correlation and regression analysis, we found that one of the metrics, the text information content, could be used as a predictor of the human evaluation. While these metrics are not proposed as a definitive measurement of the complete multi-dimensional quality of the metadata record, we present ways in which they can be used to enhance the functionality of digital repositories. Key words: Information Quality, Metrics, Metadata, Digital Libraries 1 Introduction The quality of the metadata records stored in digital repositories is perceived as an important issue for their operation   and interoperability  . The main functionality of a digital repository, to provide access to resources, can be severely affected by the quality of the metadata. For example, a learning resource indexed with the title Lesson 1 - Course CS20���, without any description or keywords will hardly appear in a search for materials about Introduction to Java Programming���, even if the described resource is, indeed, a good introductory text to Java. The resource will just be part of the repository but will never be retrieved in relevant searches. Secondary functions which metadata in a digital repository must fulfill are also heavily compromised by low metadata quality: the metadata
2 Towards Automatic Evaluation of Metadata Quality in Digital Repositories record should contain enough information for the user to obtain a good idea of what the described resource is about without the need to directly access the object incorrect or out-dated information about the URI of the resource could lead to its inaccessibility repositories with mainly low quality records which belong to a federation could degrade the performance of distributed search etc. In consequence, the usefulness of a digital repository is strongly correlated to the quality of the metadata that describe its resources. Due to its importance, metadata quality assurance has always been an inte- gral part of resource cataloging . Some implementations of digital repositories, nonetheless, have taken a relaxed approach to metadata quality assurance. Most of them relied on the assumption that metadata was created by an expert in the field or a professional cataloguer, and as such it should have an acceptable degree of quality. In reality, experts in a given field are not necessarily experts in meta- data creation, and hiring professional indexers to do the cataloging of resources is usually not feasible for most repositories. As repositories grow exponentially (through automatic metadata generation  or resource decomposition ) and merge (through search federation  or metadata harvesting ), quality issues became more apparent. This lead to the translation of techniques developed to review physical library records to assess the quality of digital metadata. Also new techniques that take advantage of computers��� ability to perform repetitive calculations have been proposed to assure a minimum level of quality. A review of previous literature on metadata quality evaluation for digital repositories reveals these two general approaches: ��� Manual Quality Evaluation. The majority of approaches (see Table 1) man- ually review a statistical significant sample of metadata records against a predefined set of quality parameters, similarly to sampling techniques used for quality assurance of library cataloguing . The human evaluations are averaged and an estimation of the quality of the metadata in the repository is obtained. While until now it is the more meaningful way to measure the metadata quality in a digital repository, this method has two main disad- vantages. First, the manual quality estimation is only valid for the whole repository at a given point in time. The quality of each individual meta- data record can only be obtained for those records contained in the sample. Also, if a considerable amount of new resources is inserted in the reposi- tory, the assessment could be no longer accurate and the estimation must be re-done. Second, and more important, obtaining the quality estimation is costly. Human experts should review an each time increasing amount of objects. Dushay and Hillman, in , propose the use of visualization tools to help metadata experts in the task, but it is still mainly a manual activ- ity. Because of this last disadvantage, manual review of metadata quality is just a research activity with no practical implications in the functionality or performance of the digital repository. ��� Simple Statisitical Quality Evaluation. From the analyzed studies, three fol- low a different approach (see Table 1). They collect statistical information
Towards Automatic Evaluation of Metadata Quality 3 from all the metadata instances in the repository to obtain an estimation of their quality. Hughes, in  calculates simple automatic metrics (com- pleteness, vocabulary use, etc) at repository level for each of the repositories in the Open Language Archive. Bui and Park  perform a wide study in which more than one million records were reviewed for completeness as qual- ity measurement. Najjar et al in , evaluating the actual use of different metadata fields for the ARIADNE repository, compare the metadata fields that are produced with the metadata fields that are consumed, providing a simplistic estimation of the quality of the metadata in the repository. While all of these approaches could automatically obtain a basic estimation of the quality of each individual metadata record, without the cost involved in the manual quality review, they do not provide a similar level of meaningful- ness as a human generated estimation. They are mainly used as interesting��� information about the repository without any other real application. Table 1. Review of different quality evaluation studies Study Approach # of Records Main focus of evaluation Greenberg et al  Manual 11 Quality of non-expert metadata Shreeves et al  Manual 140 Overall quality of records Stivila et al  Manual 150 Identify quality problems Wilson  Manual 100 Quality of non-expert metadata Moen et al  Manual 80 Overall quality of records Hughes  Statistical 27,000 Completeness of records Najjar et al  Statistical 3,700 Usage of the metadata standard Bui and Park  Statistical 1,040,034 Completeness of records An ideal measurement of metadata quality for exponentially growing repos- itories should comply with two requirements: to be automatically calculated for each one of the metadata records inserted in the repository and to provide a meaningful measurement of the quality. None of the approaches reviewed could claim to comply with these requirements. The main contribution of this work will be the proposal and evaluation of a set of automatic-calculable metadata metrics based on the same quality parameters used by human reviewers. This new set of metrics can be transform into an automated metadata quality evalu- ator that can be used to build tools for any kind of digital repository and could provide scalable and meaningful metadata quality assurance. The structure of this paper is the following: A review is conducted in section 2 to select an operationalizable framework to measure metadata quality. In sec- tion 3, several quality metrics, based on the selected framework, are proposed. An experiment is conducted in section 4 to establish the degree of correlation between the values generated by the metrics and the quality rates generated by human reviewers. Section 5 describes possible applications of the proposed quality metrics. The paper finalize with conclusions and ideas for further work.