Research in data mining and knowledge discovery in databases has mostly concentrated on developing good algorithms for various data mining tasks (see for example the recent proceedings of KDD conferences). Some parts of the research effort have gone to investigating data mining process, user interface issues, database topics, or visualization [7]. Relatively little has been published about the theoretical foundations of data mining. In this paper I present some possible theoretical approaches to data mining. The area is at its infancy, and there probably are more questions than answers in this paper. First of all one has to answer questions such as "Why look for a theory of data mining? Data mining is an applied area, why should we care about having a theory for it?" Probably the simplest answer is to recall the development of the area of relational databases. Databases existed already in the 1960s, but the field was considered to be a murky backwater of different applications without any clear structure and without any interesting theoretical issues. Codd’s relational model was a nice and simple framework for specifying the structure of data and the operations to be performed on it. The mathematical elegance of the relational model made it possible to develop advanced methods of query optimization and transactions, and these in turn made efficient general purpose database management systems possible. The relational model is a clear example of how theory in computer science has transformed an area from a hodgepodge of unconnected methods to an interesting and understandable whole, and at the same time enabled an area of industry. Given that theory is useful, what would be the properties that a theoretical framework should satisfy in order that it could be called a theory for data mining? The example of relational model can serve us also here. First of all, the theoretical framework should be simple and easy to apply; it should (at least some day) gives us useful results that we could apply to the development of data mining algorithms and methods. A theoretical framework should also be able to model typical data mining tasks (clustering, rule discovery, classification), be able to discuss the probabilistic nature of the discovered patterns and models, be able to talk about data and inductive generalizations of the data, and accept the presence of different forms of data (relational data, sequences, text, web). Also, the framework should recognize that data mining is an interactive and iterative process, where comprehensibility of the discovered knowledge is important and where the user has to be in the loop, and that there is not a single criterion for what an interesting discovery is. (We could also ask, "What actually is a theory?" For that I have the simple answer: we recognize a theory when we see it.). I start by discussing reductionist approaches, i.e., ways of looking at data mining as a part of some existing area, such as statistics or machine learning; in this case, of course, there is little need for new theoretical frameworks. Then I discuss the probabilistic approach, which is of course closely linked to statistics: it views data mining as the activity aimed at understanding the underlying joint distribution of the data. After that, I review the data compression approach to the theory of data mining. The very interesting microeconomic viewpoint on data mining is considered after that, and finally I look at the concept of inductive databases, and show how it can perhaps be used to understand and develop data mining.
CITATION STYLE
Mannila, H. (2000). Theoretical frameworks for data mining. ACM SIGKDD Explorations Newsletter, 1(2), 30–32. https://doi.org/10.1145/846183.846191
Mendeley helps you to discover research relevant for your work.