Overview of Spatial Data Mining Techniques and Spatial Clustering Algorithms

Many definitions have been stated in literature for the term of spatial data. One of the simplest definition of spatial data, describes spatial data as “information related to the space occupied by objects” (Kolatch, 2001). Moreover, spatial data can be defined as any structured or unstructured data that refers to a specific location of a certain area. The area could be a two-dimensional or a multidimensional space, as for example the surface of the earth or an imaginary multidimensional space.

In data science and computer science, spatial data differ from ordinary data. Spatial data are stored in databases with spatial extension. In this way, they use specific data types (point, polygon, line, geometry collection etc.), formats and functionalities, according to the capabilities of each database management system. Thus, Spatial Data Mining (SDM) methods differ from those used in mining regular data.

SDM is defined as the process of extracting knowledge, spatial relationships and previously unknown patterns from spatial data. SDM techniques can be classified into two main categories, the descriptive data mining techniques and the predictive data mining techniques. Furthermore, the basic tasks proposed for SDM include: (a) classification, (b) association rules, (c) characteristics rules, (d) discriminant rules, (e) clustering and (f) trend detection (Kumar, C. N. S., Ramulu, Reddy, Kotha, & Kumar, C. M., 2012; Sumathi, Geetha & Bama, 2008).

Classification

Classification is a technique where each object of a data collection is assigned a class, according to a set of rules that determine every class. The number of classes are predefined. The classification rules are based in a certain set of attributes. The scope of a classification algorithm is to create the set of rules that will determine the class of an object. Classification is considered as a predictive supervised data mining method, as it first creates a model according to which the whole dataset is analysed and the number of classes are predefined (Sumathi, Geetha & Bama, 2008).

Association rules

Association rules is a data mining technique where given a collection of objects and their occurrences, creates the rules that will predict the occurrence of an item based on the occurrences of other objects in the collection. The scope of association rules is to find patterns, which often exist in the database. Association rules in SDM finds rules from the database that are spatially related. Types of spatial relationships includes topological and distance relations among spatial objects combined with logical operators (Bembenik & Protaziuk, 2004). As for example: object A completely encloses object B, object A is within a radius of 10 meters from object B, object A comes into contact with the boundary of object B.

Characteristics rules

Characteristics rules, also called as data characterisation, is a summarisation of general features of objects in a target class and creation of a set of rules for each class. In the case of a spatial database, this technique “takes into account not only of the properties of objects, but also of the properties of their neighbourhood up to a given level” (Kumar, C. N. S., Ramulu, Reddy, Kotha, & Kumar, C. M., 2012).

Discriminant rules

A step further, discriminant rules is a technique that describes differences between two parts of a database, as for example in spatial data to find differences between cities with high and low unemployment rate (Kumar, C. N. S., Ramulu, Reddy, Kotha, & Kumar, C. M., 2012). Data discrimination is also used in characterisation in order to identify the rules that will partition the database into classes.

Trend detection

Trend detection is a technique that finds a temporal pattern in time series data. In case of spatial data, trend detection is used for finding patterns of the changes of non-spatial attributes with respect to the spatial relationship among the data, as for example the distance.

Clustering

Last, clustering is the technique of grouping a set of objects into classes, also called clusters, so that each class contains objects with high similarity to one another and as high dissimilarity as possible to objects of other classes.

In general, spatial clustering algorithms, based on cluster definition techniques, can be separated into four main categories: (a) partitional clustering algorithms, (b) hierarchical clustering algorithms, (c) density-based clustering algorithms and (d) grid-based clustering algorithms. Other related work (El-Zawawy, 2012; Padmavati, 2013; Swathi & Rajesh, 2012) classify spatial clustering algorithms into more categories, which include the previous four and the addition of the model-based clustering algorithms, the constraint-based clustering algorithms and the locality-based clustering algorithms. Most of clustering algorithms, especially those that have recently proposed, are a combination of more than one methods.

However, the majority of papers and academic work related to the subject of spatial clustering refer to the four general categories as the main categorisation of spatial clustering algorithms:

(a) Partitioning method

The partitioning method is an approach of clustering where a set of n spatial objects is grouped into k clusters based on optimising an objective criterion or else called a similarity function, which defines the level of similarity among the points. In most cases, the similarity function between two points corresponds to the distance between them. Partitioning was the earliest approach that appeared in spatial clustering and thus is still one of the most cited approach in literature (Varlaro, 2008). Most commonly used partitional clustering algorithms include k-means, k-medoids, PAM and CLARA.

(b) Hierarchical clustering

The hierarchical clustering algorithms organise a tree data structure of the clusters, also called dendrogram, using an hierarchy structure. In this tree, the root is considered as a single cluster which involves all the spatial objects in the spatial area whereas the nodes are considered as clusters with only one object. These algorithms operate repeatedly to achieve “merging or splitting until a stopping condition is satisfied or the clustering process encompassed all objects” (Otair, 2013). Hierarchical algorithms are classified into divisive and agglomerative algorithms. In divisive algorithms the decomposition of the tree is formed in a top-down approach, from root to leaves, whereas in agglomerative algorithms (Ward’s method) the decomposition is formed the opposite way, which is a bottom-up approach (El-Zawawy, 2012). Clustering algorithms that belong to this category are BIRCH, CURE, STING and DIANA.

(c) Density-based method

In the Density-based method, clusters are defined based on a mechanism of density-connected points. Clusters are defined as dense regions of objects that are separated by low density regions in the data space (Otair, 2013). Algorithms belonging to density-based category of spatial clustering algorithms have the ability to detect noise and clusters or arbitrary styles and shapes. The most known method belonging to density-based clustering is DBSCAN.

(d) Grid-based clustering

In the grid-based clustering algorithms, data objects are replaced with a number of cells that that represent the elements that will be clustered. Cells are defined with a specific grid size and form the grid structure. Each cell includes a number of data objects and summarised information, as for example the total number of objects within the cell. The operations of the algorithm for clustering are performed on the cells rather than the whole dataset. In this way the process becomes faster than in other clustering methods (El-Zawawy, 2012;Otair, 2013). Among grid-based clustering algorithms are CLIQUE, WAVECLUSTER and DENCLUE.

References

  • Andritsos, P. (2002). Data clustering techniques. Rapport technique, University of Toronto. Department of Computer Science.
  • Bembenik, R., & Protaziuk, G. (2004). Mining spatial association rules. In Intelligent Information Processing and Web Mining (pp. 3-12). Springer Berlin Heidelberg.
  • El-Zawawy, M. A. (2012). Efficient techniques for mining spatial databases. arXiv preprint arXiv:1206.0217.
  • Kolatch, E. (2001). Clustering algorithms for spatial databases: A survey. PDF is available on the Web, 1-22.
  • Kumar, C. N. S., Ramulu, V. S., Reddy, K. S., Kotha, S., & Kumar, C. M. (2012). Spatial data mining using cluster analysis. International Journal of Computer Science & Information Technology, 4(4), 71.
  • Otair, D. (2013). Approximate k-nearest neighbour based spatial clustering using kd tree. arXiv preprint arXiv:1303.1951.
  • Padmavati, V (2013). Review of Spatial Algorithms in Data Mining.
  • Sumathi, N., Geetha, R., & Bama, S. S. (2008). Spatial data mining-techniques trends and its applications. Journal of Computer Applications, 1(4), 28-30.
  • Swathi, K. G., & Rajesh, K. N. V. S. S. K. (2012). Comparative analysis of clustering of spatial databases with various DBSCAN Algorithms. IJRCCT, 1(6), 340-344.
  • Varlaro, A. (2008). Spatial clustering of structured objects (Doctoral  dissertation, University of Bari, Italy).

Leave a Reply

Your email address will not be published. Required fields are marked *