The need to analyse, process and extract knowledge from a large amount of data has been a critical subject for computer scientists and researchers since the early years of databases creation. However, in the last decade, the speed at which data is created and stored has increased exponentially and everything indicates that it will continue to grow. It is estimated that almost 2.5 quintillion bytes of data are created daily.
The main reasons that this deluge of data is growing so fast are: (a) the development of efficient software and management systems that made available the storage and processing of large amount of data, (b) the increase of global internet population, (c) the increase of cloud-based services and platforms, (d) the evolution of mobile technology and most of all (e) the excessive use of the internet in everyday life, including social media applications.
In all modern societies, the storage and maintenance of historical data as also the knowledge extraction from databases have become a matter of great importance. This affects both the private and public sector, in each and every industry. Therefore, as technologies and methodologies for storing and processing large amount of data are constantly evolving, the need of extracting knowledge out of data becomes of great concern for data scientists.
So, what is actually Knowledge Discovery in Databases (KDD) and how is it compared to Data Mining (DM)?
KDD is described as an automatic, exploratory analysis and modelling of large data repositories. Thus, KDD is “the organised process of identifying valid, novel, useful, and understandable patterns from large and complex data sets“ (Maimon & Rokach, 2005). In addition to KDD process, Data Mining (DM) as a part of KDD process is a term coined to describe the process of sifting through large databases for interesting patterns and relationships. Although contemporary researchers tend to identify DM with KDD process, DM is more than the core of the KDD process, “involving the inferring of algorithms that explore the data, develop the model for understanding phenomena from the data, analysis and prediction and discover previously unknown patterns“ (Maimon & Rokach, 2005). Also, DM is referred to as the “non-trivial process of discovering interesting, implicit, and previously unknown knowledge from large databases” (Han, Kamber & Tung, 2001).
KDD process consists of five distinct stages: (a) data selection (also known as Data Extraction), (b) pre-processing,(c) transformation, (d) data mining and (e) evaluation.
One step before the data evaluation is data mining. In this step, the transformed data are being processed using data mining techniques, as executing clustering algorithms, to search for existing patterns. Thus, DM can be an extremely complex process, particularly when this process applies to large databases or big data.
As an overview, KDD process and data mining in particular have become extremely popular fields in computer science in the past 10 years. The evolution of database management systems and data visualisation software, their upcoming functionalities, the inter-connectivity among them and the rapidly increasing amount of data coming from several sources have made Knowledge Discovery in Databases (KDD) and data mining methodologies more important than ever.
- Han, J., Kamber, M., & Tung, A. K. H. Spatial Clustering Methods in Data Mining: A Survey (2001).
- Maimon, O., & Rokach, L. (2005). Decomposition methodology for knowledge discovery and data mining. Data mining and knowledge discovery handbook, 981-1003.