Introduction Systematic approaches to dealing with missing values in
record linkage are still lacking. This article compares the ad-hoc
treatment of unknown comparison values as `unequal' with other and more
sophisticated approaches. An empirical evaluation was conducted of the
methods on real-world data as well as on simulated data based on them.
Material and Methods Cancer registry data and artificial data with
increased numbers of missing values in a relevant variable are used for
empirical comparisons. As a classification method, classification and
regression trees were used. On the resulting binary comparison patterns,
the following strategies for dealing with missingness are considered:
imputation with unique values, sample-based imputation, reduced-model
classification and complete-case induction. These approaches are
evaluated according to the number of training data needed for induction
and the F-scores achieved.
Results The evaluations reveal that unique value imputation leads to the
best results. Imputation with zero is preferred to imputation with 0.5,
although the latter shows the highest median F-scores. Imputation with
zero needs considerably less training data, it shows only slightly worse
results and simplifies the computation by maintaining the binary
structure of the data.
Conclusions The results support the ad-hoc solution for missing values
`replace NA by the value of inequality'. This conclusion is based on a
limited amount of data and on a specific deduplication method.
Nevertheless, the authors are confident that their results should be
confirmed by other empirical analyses and applications.
Mendeley saves you time finding and organizing research
Choose a citation style from the tabs below