Data sampling approaches with severely imbalanced big data for medicare fraud detection

46Citations
Citations of this article
68Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Class imbalance is an important problem in machine learning. With increases in available information and the growing use of Big Data sources to extract meaning from data, the challenges associated with class imbalance continue to influence research and shape business value. In this paper, we focus on using highly imbalanced Big Data from Medicare to detect provider claims fraud. We combine three Medicare parts and generate fraud labels using real-world excluded providers. The number of known fraudulent providers is very small, with 0.062% of the combined dataset being labeled as fraud, indicating severe class imbalance. To address class imbalance concerns, we provide experimental results incorporating six different data sampling methods (undersampling and oversampling) to create datasets for five class ratios (imbalanced to balanced), as well as using the full dataset (with no sampling). Three state-of-the-art machine learning models with Apache Spark are used to assess Medicare fraud detection performance across data sampling methods and class ratios. We demonstrate that data sampling, in particular random undersampling, presents good results across all learners, whereas oversampling provides no benefit versus models built using the full dataset.

Cite

CITATION STYLE

APA

Bauder, R. A., Khoshgoftaar, T. M., & Hasanin, T. (2018). Data sampling approaches with severely imbalanced big data for medicare fraud detection. In Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI (Vol. 2018-November, pp. 137–142). IEEE Computer Society. https://doi.org/10.1109/ICTAI.2018.00030

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free