Dual Assessment of Data Quality
in Customer Databases
ADIR EVEN
Ben-Gurion University of the Negev and
.0/msohtmlclip1/01/clip_image001.gif”>G. SHANKARANARAYANAN
Babson College
15
Quantitative assessment of data quality is critical for identifying the presence of data defects and the extent of the damage due to these defects. Quantitative assessment can help define realis-tic quality improvement targets, track progress, evaluate the impacts of different solutions, and prioritize improvement efforts accordingly. This study describes a methodology for quantitatively assessing both impartial and contextual data quality in large datasets. Impartial assessment mea-sures the extent to which a dataset is defective, independent of the context in which that dataset is used. Contextual assessment, as defined in this study, measures the extent to which the pres-ence of defects reduces a datasets utility, the benefits gained by using that dataset in a specific context. The dual assessment methodology is demonstrated in the context of Customer Relation-ship Management (CRM), using large data samples from real-world datasets. The results from comparing the two assessments offer important insights for directing quality maintenance efforts and prioritizing quality improvement solutions for this dataset. The study describes the steps and the computation involved in the dual-assessment methodology and discusses the implications for applying the methodology in other business contexts and data environments.
Categories and Subject Descriptors: E.m [Data]: Miscellaneous
General Terms: Economics, Management, Measurement
Additional Key Words and Phrases: Data quality, databases, total data quality management, information value, customer relationship management, CRM
ACM Reference Format:
Even, A. and Shankaranarayanan, G. 2009. Dual assessment of data quality in customer databases. ACM J. Data Inform. Quality 1, 3, Article 15 (December 2009), 29 pages. DOI = 10.1145/1659225.1659228. http://doi.acm.org/10.1145/1659225.1659228.
.0/msohtmlclip1/01/clip_image002.gif”>
Authors addresses: A. Even, Department of Industrial Engineering and Management (IEM), Ben-Gurion University of the Negev, Beer-Sheva, 84105, Israel; email: adireven@bgu.ac.il; G. Shankaranarayanan (corresponding author), Technology, Operations, and Information Man-agement (TOIM), Babson College, Babson Park, MA 02457-0310; email: gshankar@babson.edu.
Permission to make digital or hard copies part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from the Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org.
c 2009 ACM 1936-1955/2009/12-ART15 $10.00 DOI: 10.1145/1659225.1659228. http://doi.acm.org/10.1145/1659225.1659228.
ACM Journal of Data and Information Quality, Vol. 1, No. 3, Article 15, Pub. date: December 2009.
15: 2 · A. Even and G. Shankaranarayanan
1. INTRODUCTION
High-quality data makes organizational data resources more usable and, con-sequently, increases the business benefits gained from using them. It con-tributes to efficient and effective business operations, improved decision mak-ing, and increased trust in information systems [DeLone and McLean 1992; Redman 1996]. Advances in information systems and technology permit orga-nizations to collect large amounts of data and to build and manage complex data resources. Organizations gain competitive advantage by using these re-sources to enhance business processes, develop analytics, and acquire business intelligence [Davenport 2006]. The size and complexity make data resources vulnerable to data defects that reduce their data quality. Detecting defects and improving quality is expensive, and when the targeted quality level is high, the costs often negate the benefits. Given the economic trade-offs in achieving and sustaining high data quality, this study suggests a novel economic perspective for data quality management. The methodology for dual assessment of qual-ity in datasets described here accounts for the presence of data defects in that dataset, assuming that costs for improving quality increase with the number of defects. It also accounts for the impact of defects on benefits gained from using that dataset.
Quantitative assessment of quality is critical in large data environments, as it can help set up realistic quality improvement targets, track progress, assess impacts of different solutions, and prioritize improvement efforts accordingly. Data quality is typically assessed along multiple quality dimensions (e.g., accuracy, completeness, and currency), each reflecting a different type of qual-ity defect [Wang and Strong 1996]. Literature has described several methods for assessing data quality and the resulting quality measurements often ad-here to a scale between 0 (poor) and 1 (perfect) [Wang et al. 1995; Redman 1996; Pipino et al. 2002]. Some methods, referred to by Ballou and Pazer [2003] as structure-based or structural, are driven by physical characteristics of the data (e.g., item counts, time tags, or defect rates). Such methods are impar-tialas they assume an objective quality standard and disregard the context inwhich the data is used. We interpret these measurement methods as reflecting the presence of quality defects (e.g., missing values, invalid data items, and in-correct calculations). The extent of the presence of quality defects in a dataset, the impartial quality, is typically measured as the ratio of the number of nondefective records and the total number of records. For example, in the sam-ple dataset shown in Table I, let us assume that no contact information is avail-able for customer A. Only 1 out of 4 records in this dataset has missing values; hence, an impartial measurement of its completeness would be (4? 1)/4 = 0.75.
Other measurement methods, referred to as content-based [Ballou and Pazer 2003], derive the measurement from data content. Such measurements typically reflect the impact of quality defects within a specific usage context and are also called contextual assessments [Pipino et al. 2002]. Data-quality literature has stressed the importance of contextual assessments as the im-pact of defects can vary depending on the context [Jarke et al. 2002; Fisher et al. 2003]. However, literature does not minimize the importance of impartial
ACM Journal of Data and Information Quality, Vol. 1, No. 3, Article 15, Pub. date: December 2009.
Dual Assessment of Data Quality in Customer Databases
·
15: 3
Table I. Sample Dataset
.0/msohtmlclip1/01/clip_image004.jpg”>.0/msohtmlclip1/01/clip_image006.jpg”>.0/msohtmlclip1/01/clip_image008.jpg”>.0/msohtmlclip1/01/clip_image010.jpg”>.0/msohtmlclip1/01/clip_image012.jpg”>.0/msohtmlclip1/01/clip_image014.jpg”>.0/msohtmlclip1/01/clip_image016.jpg”>.0/msohtmlclip1/01/clip_image018.jpg”>.0/msohtmlclip1/01/clip_image020.jpg”>.0/msohtmlclip1/01/clip_image022.jpg”>.0/msohtmlclip1/01/clip_image024.jpg”>.0/msohtmlclip1/01/clip_image026.jpg”>.0/msohtmlclip1/01/clip_image028.jpg”>
assessments. In certain cases, the same dimension can be measured both impartially and contextually, depending on the purpose [Pipino et al. 2002]. Given the example in Table I, let us first consider a usage context that exam-ines the promotion of educational loans for dependent children. In this context, the records that matter the most are the ones corresponding to customers B and D: families with many children and relatively low income. These records have no missing values and hence, for this context, the dataset may be consid-ered complete (i.e., a completeness score of 1). For another usage context that promotes luxury vacation packages, the records that matter the most are those corresponding to customers with relatively higher income, A and C. Since 1 out of these 2 records is defective (record A is missing contact), the complete-ness of this dataset for this usage context is only 0.5.
In this study we describe a methodology for the dual assessment of quality; dual, as it assesses quality both impartially and contextually and draws con-clusions and insights from comparing the two assessments. Our objective is to show that the dual perspective can enhance quality assessments and help direct and prioritize quality improvement efforts. This is particularly true in large and complex data environments in which such efforts are associated with significant cost-benefit trade-offs. From an economic viewpoint, we suggest that impartial assessments can be linked to costs. The higher the number of defects in a dataset, the more is the effort and time needed to fix it and the higher the cost for improving the quality of this dataset. On the other hand, depending on the context of use, improving quality differentially affects the usability of the dataset. Hence, we suggest that contextual assessment can be associated with the benefits gainedby improving data quality. To underscorethis differentiation, in our example (Table I), the impartial assessment indi-cates that 25% of the dataset is defective. Correcting each defect would cost the same, regardless of the context of use. However, the benefits gained by cor-recting these defects may vary, depending on the context of use. In the context of promoting luxury vacation, 50% of the relevant records are defective and correcting them will increase the likelihood of gaining benefits. In the context of promoting educational loans all the relevant records appear complete. The likelihood of increasing benefits gained from the dataset by correcting defects is low.
Using the framework for assessing data quality proposed in Even and Shankaranarayanan [2007] as a basis, this study extends the framework into a methodology for dual assessment of data quality. To demonstrate the
ACM Journal of Data and Information Quality, Vol. 1, No. 3, Article 15, Pub. date: December 2009.
15: 4 · A. Even and G. Shankaranarayanan
Table II. Attributing Utility to Records in a Dataset
.0/msohtmlclip1/01/clip_image030.jpg”>
methodology, this study instantiates it for the specific context of managing alumni data. The method for contextual assessment of quality (described later in more detail), is based on utility, a measure of the benefits gained by using data. Information economics literature suggests that the utility of data resources is derived from their usage and integration within business processes and depends on specific usage contexts [Ahituv 1980; Shapiro and Varian 1999]. The framework defines data utility as a nonnegative measure-ment of value contribution attributed to the records in the dataset based on the relative importance of each record for a specific usage context. A dataset may be used in multiple contexts and contribute to utility differently in each; hence, each record may be associated with multiple utility measures, one for each usage context.
We demonstrate this by extending the previous example (see Table II). In the context of promoting luxury vacations, we may attribute utility, reflecting the likelihood of purchasing a vacation, in a manner that is proportional to the annual income; that is, higher utility is attributed to records A and C than to records B and D. In the context of promoting educational loans, utility, reflecting the likelihood of accepting a loan, may be attributed in a manner that is proportional to the number of children. In the latter case, the utilities of records B and D is much higher than that of A and C. We hasten to add that the numbers stated in Table II are for illustration purposes only. Several other factors which may affect the estimation of utility are discussed further in the concluding section.
The presence of defects reduces the usability of data resources [Redman 1996] and hence, their utility. The magnitude of reduction depends on the type of defects and their impact within a specific context of use. Our method for contextual assessment defines quality as a weighted average of defect count, where the weights are context-dependent utility measures. In the preceding example (Table II), the impartial completeness is 0.75. In the context of pro-moting luxury vacations, 40% of the datasets utility (contributed by record A) is affected by defects (missing contact). The estimated contextual complete-ness is hence 0.6. In the context of promoting educational loans, utility is un-affected (as record A contributes 0 to utility in this context) and the estimated contextual completeness is 1. Summing up both usages, 16% of the utility is affected by defects; hence, the estimated contextual completeness is 0.84. This illustration highlights a core principle of our methodology: high variability in utility-driven scores, and large differences between impartial and contextual
ACM Journal of Data and Information Quality, Vol. 1, No. 3, Article 15, Pub. date: December 2009.
Dual Assessment of Data Quality in Customer Databases · 15: 5
scores may have important implications for assessing the current state of a data resource and prioritizing its quality improvement efforts.
In this study, we demonstrate dual assessment in a real-world data environ-ment and discuss its implications for data quality management. We show that dual assessment offers key insights into the relationships between impartial and contextual quality measurements that can guide quality improvement ef-forts. The key contributions of this study are: (1) it extends the assessment framework proposed in Even and Shankaranarayanan [2007] and illustrates its usefulness by applying it in a real-world Customer Relationship Manage-ment (CRM) setting. (2) It provides a comparative analysis of both impartial and contextual assessments of data quality in the context of managing the alumni data. Importantly, it highlights the synergistic benefits of the dual as-sessments for managing data quality, beyond the contribution offered by each assessment alone. (3) Using utility-driven analysis, this study sheds light on the high variability in the utility contribution of individual records and at-tributes in a real-world data environment. Further, the study also shows that different types of quality defects may affect utility contribution differ-ently (specifically, missing values and outdated data). The proposed method-ology accounts for this differential contribution. (4) The study emphasizes the managerial implications of assessing the variability in utility contribution for managing data quality, especially, for prioritizing quality improvement efforts. Further, it illustrates how dual assessment can guide the implementation and management of quality improvement methods and policies.
In the remainder of this article, we first review the literature on quality assessment and improvement that influenced our work. We then describe the methodology for dual assessment and illustrate its application using large samples of alumni data. We use the results to formulate recommendations for quality improvements that can benefit administration and use of this data re-source. We finally discuss managerial implications and propose directions for further research.
2. RELEVANT BACKGROUND
We first describe the relevant literature on managing quality in large datasets and assessing data quality. We then discuss, specifically, the importance of managing quality in a Customer Relationship Management (CRM) environ-ment, the context for this study.
2.1 Data Quality Improvement
High-quality data is critical for successful integration of information systems within organizations [DeLone and McLean 1992]. Datasets often suffer de-fects such as missing, invalid, inaccurate, and outdated values [Wang and Strong 1996]. Low data quality lowers customer satisfaction, hinders deci-sion making, increases costs, breeds mistrust towards IS, and deteriorates business performance [Redman 1996]. Conversely, high data quality can be a
ACM Journal of Data and Information Quality, Vol. 1, No. 3, Article 15, Pub. date: December 2009.
15: 6 · A. Even and G. Shankaranarayanan
unique source for sustained competitive advantage. It can be used to improve customer relationships [Roberts and Berger 1999], find new sources of savings [Redman 1996], and empower organizational strategy [Wixom and Watson 2001]. Empirical studies [Chengalur-Smith et al. 1999; Fisher et al. 2003; Shankaranarayanan et al. 2006] show that communicating data quality assessments to decision makers may positively impact decision outcomes.
Data Quality Management (DQM) techniques for assessing, preventing, and reducing the occurrence of defects can be classified into three high-level cate-gories [Redman 1996].
(1) Error Detection and Correction.Errors may be detected by comparing datato a correct baseline (e.g., real-world entities, predefined rules/calculations, a value domain, or a validated dataset). Errors may also be detected by checking for missing values and by examining time-stamps associated with data. Correction policies must consider the complex nature of data environ-ments, which often include multiple inputs, outputs, and processing stages [Ballou and Pazer 1985; Shankaranarayanan et al. 2003]. Firms may con-sider correcting defects manually [Klein et al. 1997] or hiring agencies that specialize in data enhancement and cleansing. Error detection and correc-tion can also be automated; literature proposes, for example, the adoption of methods that optimize inspection in production lines [Tayi and Ballou 1988; Chengalur et al. 1992], integrity rule-based systems [Lee et al. 2004], and software agents that detect quality violations [Madnick et al. 2003]. Some ETL (Extraction, Transformation, and Loading) tools and other com-mercial software also support the automation of error detection and correc-tion [Shankaranarayanan and Even 2004].
(2) Process Control and Improvement.The literature points out a drawbackwith implementing error detection and correction policies. Such policies improve data quality, but do not fix root causes and prevent recurrence of data defects [Redman 1996]. To overcome this issue, the Total Data Quality Management (TDQM) methodology suggests a continuous cycle of data quality improvement: define quality requirements, measure along these definitions, analyze results, and improve data processes accordingly [Wang 1998]. Different methods and tools for supporting TDQM have been proposed, for example, systematically representing data processes [Shankaranarayanan et al. 2003], optimizing quality improvement trade-offs [Ballou et al. 1998], and visualizing quality measurements [Pipino et al. 2002; Shankaranarayanan and Cai 2006].
(3) Process Design.Data processes can be built from scratch or, existingprocesses redesigned, to better manage quality and reduce errors. Process design techniques for quality improvement are discussed in a number of studies (e.g., Ballou et al. [1998], Redman [1996], Wang [1998], and Jarke et al. [2002]). These include embedding controls in processes, supporting quality monitoring with metadata, and improving operational efficiency. Such process redesign techniques can help eliminate root causes of defects, or greatly reduce their impact.
ACM Journal of Data and Information Quality, Vol. 1, No. 3, Article 15, Pub. date: December 2009.
Dual Assessment of Data Quality in Customer Databases · 15: 7
.0/msohtmlclip1/01/clip_image032.jpg”>
Fig. 1. Dimension and fact tables.
Organizations may adopt one or more quality improvement techniques, based on the categories stated previously, and the choice is often influenced by economic cost-benefit trade-offs. Studies have shown that substantial ben-efits were gained by improving data quality [Redman 1996; Heinrich et al. 2007], although the benefits from implementing a certain technique are of-ten difficult to quantify. On the other hand, quality improvement solutions often involve high costs as they require investments in labor for monitoring, software development, managerial overheads, and/or the acquisition of new technologies [Redman 1996]. To illustrate one such cost, if the rate of manual detection and correction is 10 records per minute, a dataset with 10,000,000 records will require?16,667 work hours, or?2,083 work days. Automating er-ror detection and correction may substantially reduce the work hours required, but requires investments in software solutions. We suggest that the dual, as-sessment methodology described can help understand the economic trade-offs involved in quality management decisions and identify economically superior solutions.
2.2 Improving the Quality of Datasets
This study examines quality improvement in a tabular dataset (a table), a data storage structure with an identical set of attributes for all records within. It focuses on tabular datasets in a Data Warehouse (DW). However, the methods and concepts described can be applied to tabular datasets in other environ-ments as well. Common DW designs include two types of tables: fact and dimension(Figure 1) [Kimball et al. 2000]. Fact tables capture data on busi-ness transactions. Depending on the design, a fact record may represent a single transaction or an aggregation. It includes numeric measurements (e.g., quantity and amount), transaction descriptors (e.g., time-stamps, payment and shipping instructions), and foreign-key attributes that link transactions to associated business dimensions (e.g., customers, products, locations). Dimen-sion tables store dimension instances and associated descriptors (e.g., time-stamps, customer names, demographics, geographical locations, products, and categories). Dimension instances are typically the subject of the decision (e.g., target a specific subset of customers), and the targeted subset is commonly de-fined along dimensional attributes (e.g., send coupons to customers between 2540 years of age and with children). Fact data provide numeric measure-ments that categorize dimension instances (e.g., the frequency and the total
ACM Journal of Data and Information Quality, Vol. 1, No. 3, Article 15, Pub. date: December 2009.
15: 8 · A. Even and G. Shankaranarayanan
amount of past purchases). This study focuses on improving the quality of dimensional data. However, in real-world environments, the quality of factdata must be addressed as well, as defective fact data will negatively impact decision outcomes.
Improving the quality of datasets (dimension or fact) has to consider the targeted quality leveland the scope of quality improvement. Considering qual-ity target, at one extreme, we can opt for perfect quality and at the other, optto accept quality as is without making any efforts to improve it. In between, we may consider improving quality to some extent, permitting some imper-fections. Quality improvement may target multiple quality dimensions, each reflecting a particular type of quality defect (e.g., completeness, reflecting miss-ing values, accuracy, reflecting incorrect content, and currency, reflecting how up-to-date the data is). Studies have shown that setting multiple targets along different quality dimensions has to consider possible conflicts and trade-offs between the efforts targeting each dimension [Ballou and Pazer 1995; 2003]. Considering the scope of quality improvement, we may choose to improve the quality of all records and attributes identically. Alternately, we may choose to differentiate: improve only certain records and/or attributes, and make no effort to improve others.
From these considerations of target and scope, different types of quality improvement policies can be evaluated.
(a) Prevention.Certain methods can prevent data defects or reduce their oc-currences during data acquisition, for example, improving data acquisition user interfaces, disallowing missing values, validating values against a value domain, enforcing integrity constraints, or choosing a different (pos-sibly, more expensive) data source with inherently cleaner data.
(b) Auditing.Quality defects also occur during data processing (e.g., due tomiscalculation, or mismatches during integration across multiple sources), or after data is stored (e.g., due to changes in the real-world entity that the data describes). Addressing these defects requires auditing records, monitoring processes, and detecting the existence of defects.
(c) Correction.It is often questionable whether the detected defects are worthcorrecting. Correction might be time consuming and costly (e.g., when a customer has to be contacted, or when missing content has to be pur-chased). One might hence choose to avoid correction if the added value cannot justify the cost.
(d) Usage.In certain cases, users should be advised against using defectivedata, especially when the quality is very poor and cannot be improved.
Determining the target and scope of quality improvement efforts has to con-sider the level of improvement that can be achieved, its impact on data usabil-ity, and the utility/cost trade-offs associated with their implementation [Even et al. 2007]. Our dual-assessment methodology can provide important inputs for such evaluations.
ACM Journal of Data and Information Quality, Vol. 1, No. 3, Article 15, Pub. date: December 2009.
Dual Assessment of Data Quality in Customer Databases · 15: 9
2.3 Managing Data Quality in CRM Environments
We apply the dual-assessment methodology in a CRM setting. The efficiency of CRM and the benefits gained from it depend on the data resources: customer profiles, transaction history (e.g., purchases, donations), past contact efforts, and promotion activities. CRM data supports critical marketing tasks, such as segmenting customers, predicting consumption, managing promotions, and de-livering marketing materials [Roberts and Berger 1999]. It underlies popular marketing techniques such as the RFM (Recency, Frequency, and Monetary) analysis for categorizing customers [Petrison et al. 1997], estimating Customer Lifetime Value (CLV), and assessing customer equity [Berger and Nasr 1998; Berger et al. 2006]. Blattburg and Deighton [1996] define customer equity as the total asset value of the relationships which an organization has with its customers. Customer equity is based on customer lifetime value and un-derstanding customer equity can help optimize the balance of investment in the acquisition and retention of customers. A key concern in CRM is that cus-tomer data is vulnerable to defects that reduce data quality [Khalil and Harcar 1999; Coutheoux 2003]. Datasets that capture customer profiles and transac-tions tend to be very large (e.g., the Amazon Web site (www.amazon.com), as of 2007, is reported to manage about 60 million active customers). Maintaining such datasets at high quality is challenging and expensive.
We examine two quality defects that are common in CRM environments:
(a) Missing Attribute Values:some attribute values may not be available wheninitiating a customer profile record (e.g., income level and credit score). The firm may choose to leave these unfilled and update them later, if required. Ex-isting profiles can also be enhanced with new attributes (e.g., email address and a mobile number), and the corresponding values are initially null. They may remain null for certain customers if the firm chooses not to update them due to high data acquisition costs. (b) Failure to Keep Attribute Values Up to Date:some attribute values are likely to change over time (e.g., address, phonenumber, and occupation). If not maintained current, the data on customers becomes obsolete and the firm looses the ability to reach or target them. A re-lated issue in data warehouses is referred to as slowly changing dimensions [Kimball et al. 2000]. Certain dimension attributes change over time, caus-ing the transactional data to be inconsistent with the associated dimensional data (e.g., a customer is now married, but the transaction occurred when s/he was single). As a result, analyses may be skewed. In this study, we focus on assessing data quality along two quality dimensions that reflect the quality defects discussed before: completeness, which reflects the presence of miss-ing attribute values, and currency, which reflects the extent to which attribute values or records are outdated.
With large numbers of missing or outdated values, the usability of certain attributes, records, and even entire datasets is considerably reduced. Firms may consider different quality improvement treatments to address such de-fects, for example, contact customers and verify data or hire agencies to find and validate the data. Some treatments can be expensive and/or fail to achieve the desired results. A key purpose of the dual-assessment method proposed
ACM Journal of Data and Information Quality, Vol. 1, No. 3, Article 15, Pub. date: December 2009.