An introduction to large-scale data processing: Installation Guide of Apache Spark 2.3.1 and Hadoop 3

Given that reporting tools and related technologies evolve constantly, the need to process data close to real time is growing. The evolution of mobile technology, the excessive use of social media applications, the development of location-based services and the increase of daily data production rate, confirm the fact that large-scale data processing is increasingly gaining in importance.

One of the most widely used platforms for big data processing is Apache Hadoop. Shortly explained, Hadoop is a distributed computing framework that consists of two main components, HDFS and MapReduce. HDFS handles the data storage on the multiple machines, whereas MapReduce handles the processing of data across the machines of the cluster. As MapReduce requires a significant duration of time for its execution, it is insufficient for real-time data processing. However, this gap is filled with the introduction of streaming analytics platforms. Continue reading “An introduction to large-scale data processing: Installation Guide of Apache Spark 2.3.1 and Hadoop 3”

Comparison of Clustering Algorithms: K-Means, DBSCAN and Ward’s method

Several clustering algorithms have been introduced to literature in the last 10 years. Clustering methods usage depends on their complexity, the amount of data, the purpose of clustering and the predefined parameters. This case study, presents three of the most used clustering algorithms, K-means, DBSCAN and Ward’s method.


K-means belongs to partitioning spatial clustering algorithms. It is a frequently used clustering method and it is one of the simplest unsupervised learning algorithms. K-means defines clusters by partitioning all observations into groups, in which each observation belongs to the group with the nearest mean. The algorithm operates in iterations until the sum of squares from points to the assigned cluster centres is minimised. The end result of k-means algorithm is the partitioning of the data space into Voronoi cells. Continue reading “Comparison of Clustering Algorithms: K-Means, DBSCAN and Ward’s method”

Overview of Spatial Data Mining Techniques and Spatial Clustering Algorithms

Many definitions have been stated in literature for the term of spatial data. One of the simplest definition of spatial data, describes spatial data as “information related to the space occupied by objects” (Kolatch, 2001). Moreover, spatial data can be defined as any structured or unstructured data that refers to a specific location of a certain area. The area could be a two-dimensional or a multidimensional space, as for example the surface of the earth or an imaginary multidimensional space.

In data science and computer science, spatial data differ from ordinary data. Spatial data are stored in databases with spatial extension. In this way, they use specific data types (point, polygon, line, geometry collection etc.), formats and functionalities, according to the capabilities of each database management system. Thus, Spatial Data Mining (SDM) methods differ from those used in mining regular data. Continue reading “Overview of Spatial Data Mining Techniques and Spatial Clustering Algorithms”

Creating powerful dashboards: Analytics and Data Storytelling

Storytelling is one of the oldest forms of art. Since the very beginning of mankind’s history, storytelling has been the most powerful and communicative way to share information. This particular type of communication differs from reading and writing as in storytelling the result is adjusted according to the audience and ever more to teller’s skills. This means that a fact that that has already happened, or told in case of imaginary stories, is being reproduced in specific way by the teller for specific audience. Thus, the story contains characteristics that improve the transmissibility and successfully share the message in a more efficient way (Smith, 2015).

However, the word “storytelling” is often used in many ways. This article defines the concept of storytelling as a way of transmitting a message in an entertaining and memorable manner. Continue reading “Creating powerful dashboards: Analytics and Data Storytelling”

List of the most widely used tools for Data Mining

As data mining tasks become more crucial day by day, data mining tools and data mining techniques are rapidly increasing. Currently, there have been developed a significant number of software that provide scientists and analysts with the appropriate tools to perform data mining tasks and apply mining algorithms.

Some of the most frequently used technologies for data mining are programming languages such as R, Python, Java, Scala and Julia. Also, there are plenty of desktop software that are used for data mining activities. The list describes five of the most widely used tools for data mining. Continue reading “List of the most widely used tools for Data Mining”

An introduction to Knowledge Discovery in Databases and Data Mining

The need to analyse, process and extract knowledge from a large amount of data has been a critical subject for computer scientists and researchers since the early years of databases creation. However, in the last decade, the speed at which data is created and stored has increased exponentially and everything indicates that it will continue to grow. It is estimated that almost 2.5 quintillion bytes of data are created daily.

The main reasons that this deluge of data is growing so fast are: (a) the development of efficient software and management systems that made available the storage and processing of large amount of data, (b) the increase of global internet population, (c) the increase of cloud-based services and platforms, (d) the evolution of mobile technology and most of all (e) the excessive use of the internet in everyday life, including social media applications. Continue reading “An introduction to Knowledge Discovery in Databases and Data Mining”