In this paper we propose a new generic method to work with categorical variables in case of sequential data. Our main contributions are: (1) The use of unsupervised methods to extract sequential information, (2) The generation of embeddings including this sequential information for categorical variables using the well-known Word2Vec neural network. The use of embeddings not only reduced the memory usage but also improved the machine learning algorithms learning capacity from data compared with commonly used One-Hot encoding. We implemented those processes on a real world credit card fraud dataset, which represents more than 400 million transactions over a one year time window. We demonstrated that we were able to reduce the memory usage by 50% and to improve performance by 3% points while using only a small subset of features.
CITATION STYLE
Russac, Y., Caelen, O., & He-Guelton, L. (2018). Embeddings of Categorical Variables for Sequential Data in Fraud Context. In Advances in Intelligent Systems and Computing (Vol. 723, pp. 542–552). Springer Verlag. https://doi.org/10.1007/978-3-319-74690-6_53
Mendeley helps you to discover research relevant for your work.