Comparison of Clustering Algorithms: K-Means, DBSCAN and Ward’s method

Several clustering algorithms have been introduced to literature in the last 10 years. Clustering methods usage depends on their complexity, the amount of data, the purpose of clustering and the predefined parameters. This case study, presents three of the most used clustering algorithms, K-means, DBSCAN and Ward’s method.

K-means

K-means belongs to partitioning spatial clustering algorithms. It is a frequently used clustering method and it is one of the simplest unsupervised learning algorithms. K-means defines clusters by partitioning all observations into groups, in which each observation belongs to the group with the nearest mean. The algorithm operates in iterations until the sum of squares from points to the assigned cluster centres is minimised. The end result of k-means algorithm is the partitioning of the data space into Voronoi cells.

In general, the algorithm aims to minimise an objective function where the key parameter is the distance between each data point and its cluster centre (Varghese & Unnikrishnan, 2013). K-means, in its standard version, uses one input parameter that is the number of the expected clusters. The cluster centres, which are initially defined, are selected randomly among all points. Thus, k-means may produce different results in every execution.

Ward’s method

Ward’s method belongs to hierarchical spatial clustering methods. This method involves an agglomerative clustering algorithm. The algorithm starts from the leaves and works up to the root, or in other words operates in a bottom-up approach.

Ward’s method starts with n clusters of size 1, where n is the total number of observations or total number of points, and continues until all the observations are included into one or the preferable number of clusters. Ward’s algorithm uses one input parameter that is the number of the expected clusters.

DBSCAN

DBSCAN it is a commonly used clustering algorithm. It belongs to density based spatial clustering methods. DBSCAN is a fundamental data clustering technique for finding arbitrary shape clusters and for detecting outliers. DBSCAN de fines clusters depending on density, where density is defined as the number of points within a certain distance of each other (Varghese & Unnikrishnan, 2013).

Moreover, DBSCAN de fines clusters depending on density reachability and density connectivity among data points. Density reachability depends on the distance (Eps) between point A from point B and the number of points in point’s A neighbours, which are within this distance. Density connectivity depends on the existence of a point A, which has sufficient number of points in its neighbours, and both the points A and B are within the Eps distance. DBSCAN uses two parameters, Eps and MinPts to control the density of the cluster. Minpts, indicates the minimum number of data points in a neighbourhood to define a cluster and Eps indicates the radius of the neighbourhood around a data point.

Case study

This case study presents the results on using the above spatial clustering  algorithms to define clusters in a set of sample data. Each one is used to produce a predefined or expected number of clusters from a set of points.

K-means and Ward’s clustering methods are able to produce a specific number a clusters from a given dataset, as the number of clusters is an  necessary input parameter in both algorithms. On the other hand, with DBSCAN the numbers of clusters are not predefined. Thus, after a serial of executions and after setting different values to Eps and MinPts, the input parameters of DBSCAN where defined in order to produce the expected number of clusters.

Data Mining Tools

For the execution of k-means algorithm, ward’s clustering algorithm and dbscan clustering algorithm the functions kmeans(), hclust() and dbscan() were used respectively and implemented with RStudio. Leaflet library was used for visualisation purposes. Also, there is an R package that made it easy to integrate and configure Leafet.js library in R. Except from the visualisation of spatial data with Leaflet, plots and diagrams were produced using RStudio environment, and especially ggplot and magrittr libraries. The tools used for data extraction, data clearance, spatial transformations and spatial data storage tasks were QGIS software and PostGIS/PostgreSQL database.

Sample Data

Data concerned points of interest located in the wider area of Luxemburg. The points were collected randomly from OSM data and afterwards were selected based on density criteria. The objective was to create three equal samples, that means samples with equal number of points included in areas with equal surface, which would differ in terms of points density.    Therefore, three set of 200 points were created and de fined as samples. In sample C points can be separated in groups more easily than in other samples, in sample B points are uniformly distributed in the area of study and thus it is very difficult to be partitioned in groups and in sample A points are in an intermediate state between sample B and sample C.

Sample A is characterised by disparities in the distribution of points. In the northwest and western parts of the study area, some groups of points are distinguished. This means that any of the algorithms applied should easily identify these groups of points. In the eastern part of sample A, the points are widely spread out and some among them can be identified as outliers or else noisy data. Thus, the method that is expected to define clusters in the most efficient way is DBSCAN.

On the other hand, sample B is characterised by an even distribution of the points in the study area. In the southern part of the area, a group of points are distinguished, which should be as in the previous sample identified by the algorithm. Except for this group, the density of the points is almost the same within the boundaries of the sample. Therefore the results of sample B are expected to differ in each of the three algorithms applied.

Sample C is characterised by distinct groups of points. The distribution of the points composes a picture of 10 groups, which can be easily distinguished by the human eye. The algorithms should be able to locate and separate each and every group in a similar format. However, the density of each group differs in such a way that no group has the same number of points. In this case, the method that is expected to separate the sample in 10 clusters and have more actual results is the hierarchical clustering method.

The objective at this point is to execute k-means, DBSCAN and Ward’s algorithms in every sample in order to create 10 clusters of points, compare the results and con firm the assumptions described above.

K-means, Ward’s and DBSCAN clustering algorithms are applied to each sample of points in order to produce 10 clusters out of 200 points. The number of points inside each cluster is not taken into consideration.

Experimental results    The first sample contains the most arbitrary shapes of grouped points. As expected, DBSCAN algorithm provides with the most efficient clusters among all three methods and the initial assumption is confirmed.

The key parameter of its effectiveness is the fact that DBSCAN takes into account noisy data and furthermore defines clusters depending on connectivity and reachability among points. Therefore, DBSCAN is the only method that is able to identify the group of points in the southern location of the study area (orange cluster in DBSCAN’s output) as a single cluster.    The results of the second sample vary in all cases. The plots show that k-means is the algorithm that defines clusters with relatively uniform sizes whereas DBSCAN identifies different sized clusters with more convex shapes. In contrary to DBSCAN, Ward’s method results are relative to k-means output.

The only method that identifies clusters with larger size is DBSCAN. Thus, in this case it is more difficult to conclude which algorithm produces the most efficient clusters. However, all three methods produced very different groups as it was expected.    The third sample is the one that can be more easily separated into the exact number of 10 groups. However, the results indicate that k-means algorithm separates a group of points in the centre of the study area into two different groups. In addition to k-means, DBSCAN’s sensitivity to noisy data identifies some points in the western side of the area as outliers and thus excludes them from the clusters.

The only algorithm that produces the expected clusters is the Ward’s hierarchical clustering method. Unlike DBSCAN and k-means, Ward’s can identify concentric clusters. Therefore provides with better clustering results and con firms the initial expectations.

References

• Varghese, B. M., & Unnikrishnan, A. (2013). Spatial clustering algorithms – An overview. Asian journal of computer science & information technology, 3 (1).