Incorporating domain-specific inf...
11 Incorporating Domain-Specific Information Quality Constraints into Database Queries SUZANNE M. EMBURY, PAOLO MISSIER, SANDRA SAMPAIO, and R. MARK GREENWOOD University of Manchester and ALUN D. PREECE Cardiff University The range of information now available in queryable repositories opens up a host of possibilities for new and valuable forms of data analysis. Database query languages such as SQL and XQuery offer a concise and high-level means by which such analyses can be implemented, facilitating the extraction of relevant data subsets into either generic or bespoke data analysis environments. Unfortunately, the quality of data in these repositories is often highly variable. The data is still useful, but only if the consumer is aware of the data quality problems and can work around them. Standard query languages offer little support for this aspect of data management. In principle, however, it should be possible to embed constraints describing the consumer���s data quality re- quirements into the query directly, so that the query evaluator can take over responsibility for enforcing them during query processing. Most previous attempts to incorporate information quality constraints into database queries have been based around a small number of highly generic quality measures, which are defined and computed by the information provider. This is a useful approach in some application areas but, in practice, quality criteria are more commonly determined by the user of the information not by the provider. In this article, we explore an approach to incorporating quality constraints into database queries where the definition of quality is set by the user and not the provider of the information. Our approach is based around the concept of a quality view, a configurable quality as- sessment component into which domain-specific notions of quality can be embedded. We examine how quality views can be incorporated into XQuery, and draw from this the language features that are required in general to embed quality views into any query language. We also propose some syntactic sugar on top of XQuery to simplify the process of querying with quality constraints. The Qurator project was supported by a grant from the EPSRC. Authors��� addresses: S. M. Embury, P. Missier, S. Sampaio, and R. M. Greenwood, School of Computer Science, University of Manchester, Oxford Road, Manchester, M13 9PL, UK email: firstname.lastname@example.org A. D. Preece, School of Computer Science, Cardiff University, Cardiff, Wales, UK. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct com- mercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from the Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or email@example.com. c 2009 ACM 1936-1955/2009/09-ART11 $10.00 DOI: 10.1145/1577840.1577846. http://doi.acm.org/10.1145/1577840.1577846. ACM Journal of Data and Information Quality, Vol. 1, No. 2, Article 11, Pub. date: September 2009.
11: 2 �� S. M. Embury et al. Categories and Subject Descriptors: H.2.3 [Database Management]: Languages���Query languages General Terms: Languages Additional Key Words and Phrases: Information quality, database query languages, XQuery, views ACM Reference Format: Embury, S. M., Missier, P., Sampaio, S., Greenwood, R. M., and Preece, A. D. 2009. Incor- porating domain-specific information quality constraints into database queries. ACM J. Data Inform. Quality 1, 2, Article 11 (September 2009), 31 pages. DOI = 10.1145/1577840.1577846. http://doi.acm.org/10.1145/1577840.1577846. 1. INTRODUCTION The modern data consumer is both imaginative and resourceful. New data sources, and new uses for existing data, appear far more rapidly than bespoke analysis tools can be created. Generic data analysis tools, such as spread- sheets, can be used by skilled data consumers, but in many cases information management staff must extract the relevant subsets of the data, in formats suitable for interpretation by consumers with domain rather than technical expertise. Declarative query languages are a valuable support tool for techni- cal personnel in this role, allowing rapid prototyping of reports for ad hoc or novel analyses and easy export of data to generic analysis tools. Queries can also be embedded into programming languages, to offer a concise and high- level substrate for the implementation of bespoke analysis tools, once the need (and justification) for them becomes apparent. Unfortunately, these same circumstances (new types of data and new appli- cations for existing data) are exactly those in which problems with the quality of data, and therefore the quality of query results, are likely to arise [Blaha 2001]. Missing or inaccurate data, out-of-date or imprecise information will all propagate through queries to produce results that are challenging to interpret, reducing the value of the new analysis tool. When resources are available and the importance of the analysis justifies the expense, datasets can be cleaned prior to application of queries [Dasu and Johnson 2003]. However, in many cases, especially in the case of ad hoc or novel analyses, prequery cleaning is not cost effective, and the data consumer is left with the responsibility for filtering out bad results using their knowledge of the domain semantics. A preferable solution for such cases would be for the end user���s quality require- ments to be incorporated into the query by the query writer, so that filtering or cleaning of results could be enforced automatically by the query processor, on only the dataset that is of relevance to the end user���s needs. The ability to express Information Quality (IQ) constraints within a query language offers other advantages. In practice, assessing the quality of data is a complex task, demanding extensive domain knowledge and experience of working with the type of data being assessed.1 If useful constraints on IQ can 1For examples of the complexity of domain-specific forms of data quality measure, see the work of Burgoon et al. , Korn et al. , and Heim et al. . ACM Journal of Data and Information Quality, Vol. 1, No. 2, Article 11, Pub. date: September 2009.
Incorporating IQ Constraints into Queries �� 11: 3 be expressed within standard query languages, then the knowledge of domain experts regarding IQ can be packaged in a form that is straightforward for technical staff supporting less expert users to access and reuse. This packag- ing of domain expertise becomes even more useful when the consumer of query results is not a human but a piece of software, and is therefore even less able to detect mistakes or omissions than a novice user. Data errors can easily spread into new databases through the use of computational processes to derive new knowledge from old (as commonly occurs in e-science), causing the phenom- enon known as data pollution [Redman 1996]. In such cases, it can be difficult to discover the source of errors once they have spread across several systems, which makes cleaning even more expensive. Effective, automated assessment of data quality before datasets are used by computational components is one means by which the spread of data pollution can be limited. 1.1 Provider-Centric IQ Assessment Several researchers have proposed mechanisms by which queries can be ex- pressed over data and IQ metadata (e.g., Naumann et al. , Scannapieco et al. , and Martinez and Hammer ). As we shall discuss (in Section 2), most of this previous work has taken a provider-centric approach to the assessment of IQ. By this, we mean that the task of defining what forms of IQ should be supported by queries and of assessing individual data values against them is the responsibility of the provider(s) of data or the query facil- ity, rather than of the data consumer. In the small number of proposals where this is not the case, data quality measures are precomputed by the execution of custom code or pre-assessed by human intervention, and so are decoupled from the actual action of the query processor. This provider-centric approach is useful when it is possible for all consumers of a dataset (or collection of datasets) to agree on a small number of widely applicable IQ measures. But, in many domains, and for many datasets, the needs of individual data consumers vary widely, depending on the specific application in hand. This is because IQ, like other forms of quality, is a relative not an absolute concept. According to the most commonly quoted quality definition, information is of high quality if it is fit for purpose [Batini and Scannapieco 2006]���something that can only be judged by the consumer of the information. Moreover, what is high quality data for one group of users may be considered poor by others. For example, a common scenario found in both e-business and e-science is that datasets are typically considered to be of acceptable quality for the application for which they were originally created, but are found to be of low quality when applied to a new application [Blaha 2001]. 1.2 User-Centric IQ Assessment: A Motivating Example In this article, we set out to explore the complementary, consumer-centric approach, in which highly domain-specific IQ constraints can be added to database queries by consumers, without imposing any requirements on the owners of the queried sources to provide specific quality metadata or ACM Journal of Data and Information Quality, Vol. 1, No. 2, Article 11, Pub. date: September 2009.
11: 4 �� S. M. Embury et al. special-purpose quality measurement functions. This approach is motivated by the observation that, in many information-intensive applications, the decision regarding whether to accept or reject a data item is based on a combination of objective measures (quality indicators) and more subjective, consumer-specific criteria. This is increasingly the case in e-science, for example, where ���fitness for purpose��� is defined differently by different scientists with different exper- imental goals in view, even when broadly the same sets of objective quality indicators are used. The computational unit we use to encapsulate such user-specific definitions of fitness is the quality view: a shareable, reconfigurable IQ component that defines one particular way of assessing the quality of some particular kind of data. Quality views are designed to support users during the quality assess- ment steps of a quality-aware information lifecycle consisting of (i) information acquisition, (ii) quality assessment, (iii) filtering or editing for quality improve- ment, and (iv) information use. To get a flavor of the type of assessments that quality views facilitate, con- sider the common problem of predicting the correctness of customer address data, when no correctness assurance is provided by the data supplier.2 In such a case, the quality assessment performed by the data consumer will typi- cally be based on heuristics and indirect evidence. Criteria may include, for example, counting the number of addresses recorded for individuals in the dataset, as well as using a trusted reference set to determine the validity of postcodes/zip codes in addresses. Other sources might contain related infor- mation, such as a database of records of bill payments, which can be used to cross-check against the address data for consistency and reasonability. Application of these criteria can be seen as a process consisting of the following main steps: (1) Issuing a query against the address data to count the number of distinct addresses per individual. These counts are associated with each address as the first quality indicator. (2) Validating the postcodes in the address data against a reference database of choice. Any invalid or mismatched postcodes are recorded with each address as the second quality indicator. (3) Issuing of queries to a bill-payment database, where some of the same cus- tomers are expected to be found, so that any discrepancies can be recorded against addresses as the third quality indicator. (4) Combining all three quality indicators into a single quality score, by means of a user-defined quality function, for each address. The quality function encapsulates the ���quality knowledge��� used to make the assessment, and may be induced as a predictive model using machine learning algorithms or may be the result of user design. (5) Using the quality score for a given address to decide whether to accept (i.e., trust) it or to reject it. 2A more complex, real-life example from the life sciences is presented in Section 4. ACM Journal of Data and Information Quality, Vol. 1, No. 2, Article 11, Pub. date: September 2009.
Incorporating IQ Constraints into Queries �� 11: 5 Note that steps (2) and (3) as just given may involve the use of similar- ity measures, and may result in corrections being made to the data where appropriate. The main point, however, is that quality assessment steps are interleaved with data access (i.e., query) steps. Thus, applying this process within a specific user query is a matter of reducing the scope of the quality assessment steps to the data touched by the query, rather than applying to the entire dataset. More generally, we would like to be able to allow any query Q to be transformed into a quality-aware version Q that adds user-specified IQ constraints and that returns only the subset of Q���s results that satisfies those constraints. 1.3 Contributions The main contribution of this article, presented in Section 4, is an analysis of the features that must be supported by a query language in order to allow the incorporation of domain-specific IQ constraints in the form of quality views. (The quality view model itself is described and motivated in Section 3.) In particular, we show how IQ constraints over XML data can be incorporated in XQuery expressions, as well as describing an execution model for the result- ing quality-aware XQueries. Quality views were originally designed for access from software, however, and the minimal essential set of languages features do little in themselves to shield the query writer from the low-level details of the QV API. We therefore further propose some syntactic sugar (tailored for XQuery, but easily adaptable to other contexts) to make queries using qual- ity views shorter and more readable (Section 5). We illustrate the usefulness of the overall approach by giving some example quality-constrained queries from the application domain of proteomics (Section 5.3) and conclude with a discussion of directions for future work (Section 6). Before delving into these details, we present a survey of existing approaches to quality assessment in a query context. 2. RELATED WORK 2.1 Assessing Information Quality Poor information quality manifests itself in a variety of different costs for or- ganizations that rely on the information for both operational and strategic decision making [Batini and Scannapieco 2006]. Errors in data can directly reduce the amount of productive work that an organization can undertake, due to the need to spend time recovering from errors, but they can also have more far-reaching, less easily quantifiable costs. Customer satisfaction and loyalty can be damaged by errors, reducing the chance for future busi- ness, as can employee morale. Similarly, organizations can be prevented from changing their business rules and policies, if software systems cannot easily be adapted, due to unexpectedly poor information quality. Taken together, Redman conservatively estimated these costs to cover 10% of revenue, but suggested that the actual figure could be closer to 20% for some organizations [Redman 1998]. ACM Journal of Data and Information Quality, Vol. 1, No. 2, Article 11, Pub. date: September 2009.