Clustering categorical data based on the relational analysis approach and MapReduce
 Yasmine Lamari^{1}Email authorView ORCID ID profile and
 Said Chah Slaoui^{1}
Received: 10 August 2017
Accepted: 18 September 2017
Published: 22 September 2017
Abstract
The traditional methods of clustering are unable to cope with the exploding volume of data that the world is currently facing. As a solution to this problem, the research is intensified in the direction of parallel clustering methods. Although there is a variety of parallel programming models, the MapReduce paradigm is considered as the most prominent model for problems of large scale data processing of which the clustering. This paper introduces a new parallel design of a recently appeared heuristic for hard clustering using the MapReduce programming model. In this heuristic, clustering is performed by efficiently partitioning categorical large data sets according to the relational analysis approach. The proposed design, called PMRTransitive, is a singlescan and parameterfree heuristic which determines the number of clusters automatically. The experimental results on reallife and synthetic data sets demonstrate that PMRTransitive produces good quality results.
Keywords
Categorical data Clustering MapReduce Relational analysis approachIntroduction
Over the past decade, the amount of information accumulated every second has become a treasure of inestimable value. Social media sites, sensors, transactions records and many other sources that come from everywhere are behind the Big Data phenomenon. Consequently, considerable efforts have been devoted to exploring such massive data in order to gain the maximum benefit from this treasure. To cope with the huge volume of data, various parallel programming frameworks have recently emerged.
Clearly, MapReduce is the most prominent model for problems of largescale data processing. It was proposed by Dean and Ghemawat [1] at Google where it was successfully used for various purposes. The strengths of this model are summarized in the fact that it allows automatic parallelism and distribution. In addition to the faulttolerant mechanism that helps in overcoming failures, it provides also tools to manage the status, monitoring, and load balancing. The locality optimization is ensured by storing data on local disks to avoid network bandwidth consumption. Thereby, MapReduce allows us to focus on the problem rather than on complex details of parallel programming.
Recently, the research studies in data mining are increasingly interested in the concept of parallel programming. Data mining covers a wide variety of data analysis procedures, including the classification, the regression, the clustering, and so on. In this paper, we focus on the clustering procedure, which aims to partition data into groups of similar objects fulfilling the conditions of the maximizing the similarity between objects in the same group, and the minimization of the similarity between objects in different groups [2]. In order to solve this problem, we propose PMRTransitive, which is a new parallel heuristic based on the MapReduce programming model of a recently appeared method, named Transitive heuristic [3]. In this heuristic, clusters are obtained by partitioning categorical large data sets according to the relational analysis approach [4]. The relational analysis approach provides a mathematical formalism where the problem of clustering takes the form of a linear program with \(n^2\) integer attributes (with n the number of instances). Heuristics are the most convenient solution to produce satisfactory clustering results in the fastest time, particularly in the context of Big Data, where the number of instances is large and the response time is a critical factor. Since the original heuristic is sequential, it needs to be adjusted to the MapReduce model. This paper provides a detailed description of the new design based on the key methods of the MapReduce model, namely, Map and Reduce. And advantageously, most steps which produce high computational costs involved in Transitive heuristic can be processed in parallel.
The remainder of this paper is organized as follows: "Motivation and related work", presents briefly the MapReduce model and some related work. In "Relational analysis approach", the problem statement is formalized. The overview of the Transitive heuristic and its new version based on MapReduce are described in "Transitive heuristic" and "MapReducebased implementation of transitive" respectively. Subsequently, the experimental results are presented in "Results and discussion". Finally, conclusions and propositions for future studies are drawn briefly in "Conclusions".
Motivation and related work
With the continuous increase of the data volume, the traditional methods of clustering have reached their limitations giving rise to the parallel clustering. In this section, we review the MapReduce frameworks and some related clustering algorithms.
MapReduce [1] is considered one of the most prominent programming models for problems of large scale data processing of which the cluster analysis. It consists of two phases: Map and Reduce. The Map phase is responsible for filtering and sorting, while the Reduce phase is in charge of summarizing the outputs of the previous phase. The Map function receives records from the input files as keyvalue pairs and produces intermediate keyvalue pairs. When the Map phase is completely finished, the Reduce phase starts. Each Reducer works on the values of a specific intermediate key and produces one final value for the same key.
There are several implementations of the MapReduce programming model. Hadoop offers the most popular framework in Java. Authored by Apache Software Foundation, the project includes modules enabling a reliable and scalable distributed computing as an open source framework. For the qualities that it provides, namely, its organized architecture, scalability, cost effectiveness, flexibility and resilience to failure, Hadoop MapReduce framework is used for the implementation of the proposed method.
Since there are several MapReducebased clustering algorithms, we mention in this section only a few relevant works such as the PKMeans [5] algorithm. PKMeans is a MapReducebased implementation of the kmeans algorithm. It is designed with a single MapReduce job, in which the Map function is responsible for the assignment of each sample to the nearest center, and the Reduce function is responsible for updating the new centers. In addition to the combiner function which aggregates partially the values of the points assigned to the same cluster in order to Reduce the communication cost. The experiments, which have been performed in a cluster of four nodes, demonstrate that this approach can process large data sets.
In 2011, He et al. [6] proposed the MRDBSCAN algorithm, which is a MapReducebased implementation of the wellknown DBSCAN algorithm. The proposed parallel method consists of four steps. In the first step, the size and the general spatial distribution of the total records are summarized, then, a list of dimensional index indicated an approximate grid partitioning is generated for the next step. The second step performs the main DBSCAN process for each subspace divided by the partition profile. The third step handles the cross border issues when merging the subspaces. At the end, a cluster ID mapping, from local clusters to global one, is built for the entire data set based on pairs lists collected from the previous step. Then, the local ID’s are changed by the global ones for points from all partitions in order to produce a united output. The experiments, which have been performed in a cluster of 13 nodes, demonstrated that the MRDBSCAN is efficient on large data sets since it was tested with data sets up to 50.4 GB.
In the same context, Kim et al. [7] suggested a new densitybased clustering algorithm, called DBCURE, in addition to its parallel version, called DBCUREMR, which is implemented using the MapReduce programming model. DBCURE acts as DBSCAN by reiterating two steps. The first step selects an unvisited point in the data set which is considered as a seed and then inserts it to the seed set. In the second step, all points that are densityreachable from the seed set are retrieved. This process produces clusters one at a time and stops when the seed set becomes empty, contrary to its parallel version, which finds several clusters at the same time by treating each core point in parallel through four steps. The first step is responsible for the estimation of the neighborhood covariance matrices and it is performed using two MapReduce algorithms. The second step performs the computation of ellipsoidal \(\tau\)neighbourhoods and it is performed using two other MapReduce algorithms. The third step discovers core clusters, which is done by a single MapReduce algorithm. Finally, the last step is responsible for the merge of core clusters and it is performed with a single MapReduce algorithm. The experiments, which were performed in a cluster of 20 nodes with data sets reaching 0.5 GB, demonstrated that the proposed approach scales up well with the MapReduce programming model.
Most the proposed MapReducebased clustering algorithms focused on the kmeans and the DBSCAN methods which deal only with numerical data (points). It is therefore not obvious to compare the results produced by the PMRTransitive, which operates on categorical data (records), with such methods. So in terms of quality, we suggest comparing the clustering results obtained by the proposed method with two serial clustering algorithms wellknown for clustering categorical data, which are the MMR [8] and some enhanced versions of kmodes [9]. The MMR (MinMinRoughness) algorithm is based on the rough set theory, which requires the number of clusters as an input and uses a new similarity method based on the roughness concept to produce stable results. This algorithm is distinguished by the ability to handle uncertainty in the clustering process. Bai and Liang proposed to use the betweencluster information to improve the effectiveness of some existing versions of the kmodes algorithm. Clustering results of categorical data sets have demonstrated that the improvements brought to the kmodes algorithms are effective and scalable.
The PMRTransitive, which is proposed in this paper, is a new parallel design of the Transitive heuristic implemented using the Hadoop MapReduce framework. As mentioned in the "Introduction" section, the Transitive heuristic is a fast heuristic which finds clusters by partitioning categorical large data sets according to the relational analysis approach. The proposed method presents some relevant points; it processes categorical large data sets rapidly, without any prior settings or sampling method, and guarantees a good quality solution in a reasonable time. Indeed, contrary to other algorithms, the number of clusters is automatically detectable by the Transitive heuristic and its new parallel design.
Relational analysis approach
The relational analysis approach is a mathematical data analysis model used in different fields including clustering. It was conceived by J F Marcotorchino and P Michaud in the late 1970s at the IBM European Center of Applied Mathematics [4]. The relational analysis is defined as an optimization problem under linear constraints of the Condorcet’s criterion. The detailed mathematical representation of the relational analysis approach can be found in [10]. In this subsection, we review some basic notions of this model, since it is the basis of the proposed parallel method and its original version.
Let \(E=\{1,2,...,n\}\) be the data set of n instances described by the set \(V=\{v_1,v_2,...,v_k\}\) of k categorical attributes. We denote by \(v_l(i)\) the value domain of the categorical attribute \(v_l\) for the instance i.
In other words, c(i, j) is the number of attributes for which the instances i and j share the same value domain.
Transitive heuristic
The purpose of the Transitive heuristic is to transform a structure of covering, which is not transitive, into a transitive solution. In this section, we present the overview of Transitive heuristic and we formalize the definitions and concepts related to this method.
Preliminary
Let \(M=\{m_1,m_2,...,m_q\}\) be the set formed by all value domain of the k categorical attributes. Then we denote by M(i) the set of values terms for the instance i.
Profile definition
Cluster definition
The representative element is the generator of the cluster; it helps to speed up the heuristic in the phase of separation of nondisjoint clusters. Indeed, it allows calculating the similarity of a shared instance with the representative of each cluster, instead of calculating the similarity with all the instances contained in the cluster.
Contribution function
Overview of transitive heuristic
The process of the Transitive heuristic is shown in Fig. 1. This heuristic consists of four main steps: initialization, construction, intersection, and evaluation.
Firstly, in order to build the first cluster, a random instance, called representative, is selected randomly and used to cluster instances which resemble it using the coefficient function. All identifiers of clustered instances are saved in order to avoid selecting a new representative which is already clustered for the next iterations.
The repetition of iterations can generate fuzzy clustering. So in order to have distinctive clusters, the intersection of clusters is calculated. Then, for each shared instance a decision of the suitable cluster is made. This decision consists of computing the contribution of the shared instance based on the representatives of clusters. Then, the highest value means that the corresponding representative is the closest to the shared instance in where it will be kept and removed from the others.
By virtue of its suitable features, Transitive heuristic provides good quality results in reasonable computing time and that without using the traditional sampling methods or even the setting of input parameters, such as the number of clusters, thresholds, and other parameters. However, in its serial version, it cannot take advantage of the distributed systems to process big data. To make the Transitive heuristic run in a parallel environment, some adjustments are necessary that we will discuss in the next section.
MapReducebased implementation of transitive
The new design of Transitive heuristic based on the MapReduce framework is illustrated in Fig. 2. Multiple mappers run in parallel and produce partial clustering. Then, a single reducer runs and transforms the initially obtained partitions into a final result of hard clustering.
Mapfunction
The Map function performs an initial clustering of the input data block. It gathers the input instances (pairs) in clusters using the Condorcet’s criterion as a similarity measure and it allows the instances to belong to more than one cluster. Then, it recalculates the representatives of clusters in the initial solution (partition). The representative of a cluster is calculated on the basis of the frequency of occurrence of features in the cluster. When the mapper is complete, it returns as an intermediate result a fuzzy clustering of the received part of the input file that will be refined later in the Reduce phase.
The Map function outputs a collection of cluster structures. Each cluster structure contains the representative instance and only the identifiers of the instances that are members of the cluster with their similarity scores. The similarity scores are useful in the Reduce phase because they avoid the recalculations of similarities in the evaluation of the contribution of shared instances. This technique allows dispensing with the data of clusters’ members in the Reduce phase, thus, we decrease the amount of data sent from the Map phase to the Reduce phase. The pseudo code of the function of the Map phase is given below.
Reducefunction
The clusters produced during the Map phase of each host may share some instances. However, the aim of the proposed method is to produce a hard clustering. The Reduce phase is responsible for the separation of clusters. This is achieved by computing the intersection of clusters in order to determine the shared members. Then, for each shared member we compare its similarity to the clusters in which it belongs. This information lies in the similarity scores of this member. In fact, the higher score indicates that the corresponding cluster is the most suitable for the shared member, and in which it will be kept in this cluster and removed from all others. This process stops when all clusters are disjointed.
In this proposed parallel design of Transitive heuristic, we considered a single reducer, and therefore it can be thought that this is expensive in term of run time. But the Reduce phase involves merely some comparisons between the members of clusters and there is no need to recalculate the similarities between members and representatives. And so, we obtain clusters which are consistent and accurate as a final steady result. The pseudo code of the function of the Reduce phase is given below.
Discussion

Unlike Transitive heuristic, the selection of representatives of clusters in PMRTransitive is not carried out in a random way. In fact, the first keyvalue pair introduced to the Map function is considered to be an initial representative. The next inputs that follow are either added to existing clusters, or considered as initial representatives that will be updated after the construction step, and so on.

In the Transitive heuristic, the initialization step takes place before the construction step. Indeed, first the representative is selected, and then its cluster is built. The PMRTransitive shifts those steps because of the forsaken random selection and which is not applied to the MapReduce framework.

In the Transitive heuristic, the representatives are instances that are selected randomly from the data set, while in the PMRTransitive heuristic they are fictive and computed based on the profile of the cluster’s members.

Another important point, it concerns the fact that Transitive heuristic is an iterative method, while the proposed parallel design is a singlepass heuristic.
Results and discussion
In this section, we evaluate the quality of clustering results and the performance of PMRTransitive regarding some commonly used categorical reallife and synthetic data sets. The clustering result is assessed by the purity (also called accuracy) measure, the normalized mutual information (NMI), and the adjusted rand index (ARI) [12].
The performance experiments were run on a single node, which has a quadcore processor of 3.60 GHz and 8 GB of memory and using Hadoop version 2.2.0 and Java 1.7.0. The size of blocks used for the experiments is 64 MB.
Clustering evaluation metrics
Purity
Normalized mutual information
Adjusted rand index
Data sets description
Description of reallife data sets
Data set  Size  Number of attributes  Number of classes  Missing values 

Soybean  47  35  4  No 
Zoo  101  17  7  No 
Mushroom  8124  22  2  Yes 
Clustering results
Clustering result of PMRTransitive applied to the soybean data set
Cluster  Size  Distribution  Purity  

\({C_{1}}\)  \({C_{2}}\)  \({C_{3}}\)  \({C_{4}}\)  
1  10  10  0  0  0  1 
2  10  0  10  0  0  1 
3  9  0  0  9  0  1 
4  7  0  0  1  6  0.85 
5  11  0  0  0  11  1 
Clustering result of PMRTransitive applied to the zoo data set
Cluster  Size  Distribution  Purity  

\(C_1\)  \(C_2\)  \(C_3\)  \(C_4\)  \(C_5\)  \(C_6\)  \(C_7\)  
1  42  41  0  1  0  0  0  0  0.98 
2  5  0  4  1  0  0  0  0  0.80 
3  17  0  16  0  0  0  1  0  0.94 
4  17  0  0  3  13  1  0  0  0.76 
5  3  0  0  0  0  3  0  0  1 
6  5  0  0  0  0  0  5  0  1 
7  12  0  0  0  0  0  2  10  0.83 
Clustering result of PMRTransitive applied to the mushroom data set
Cluster  Size  Distribution  Purity  

\(C_1\)  \(C_2\)  
1  2010  1937  73  0.96 
2  768  768  0  1 
3  307  307  0  1 
4  89  48  41  0.54 
5  185  185  0  1 
6  192  192  0  1 
7  17  17  0  1 
8  48  48  0  1 
9  963  706  257  0.73 
10  1719  0  1719  1 
11  291  0  291  1 
12  36  0  36  1 
13  1296  0  1296  1 
14  7  0  7  1 
15  8  0  8  1 
16  188  0  188  1 
Evaluation of clustering results of PMRTransitive according to the purity, NMI, and ARI metrics
Data set  Purity  NMI  ARI 

Soybean  0.97  0.95  0.78 
Zoo  0.91  0.85  0.88 
Mushroom  0.96  0.82  0.44 
Table 5 shows the evaluation of clustering results of PMRTransitive according to the purity, the NMI, and the ARI metrics. When the results are assessed using the ARI measure, the number of clusters must meet the number of classes in the clustering partition recognized as the ground truth in order to maximize the value of ARI. This is not applicable for the PMRTransitive heuristic since it automatically detects the number of clusters. This explains why the values of ARI decrease somewhat.
Figure 3 presents a comparison of the overall purity obtained by applying the proposed method using the reallife data sets described above. For comparison purposes, we present some results reported in [8] and [9] concerning, respectively, the performance of the MMR algorithm and some enhanced kmodes versions applied to the same data sets used to assess the quality of results produced by the proposed method.
The original Transitive heuristic outperforms its new parallel version, PMRTransitive, as to the mushroom data set. This can be explained by the fact that the Transitive heuristic is a multiple scan method, while the PMRTransitive is a singlescan method. However, it must not be forgotten that the results produced by the original method represent the best in 100 runs, since the random start of the Transitive heuristic produces a different solution for each run, while the proposed method reaches good quality results, stable, and reproducible.
In general, the performance of the proposed method and its predecessor is better than the results obtained with the MMR and the kmodes algorithms, except for the soybean data set, where the enhanced Ng’s kmodes algorithm produces a highquality result reaching 99% of purity.
It should be noted that Transitive and its new parallel design are different from other methods of clustering categorical data, including, MMR and kmodes, regarding some relevant points. First, PMRTransitive is a free parameter method, which determines automatically the number of clusters. Second, the proposed method operates on the entire data set and does not use any kind of data sampling or any data preprocessing.
Performance results
As described in "MapReducebased implementation of transitive", PMRTransitive is designed with multiple parallel Map tasks and a single Reduce task that may appear as a bottleneck. However, Fig. 5 shows that more the size of the data set is large, more the time consumed by the Reduce task decreases.
Conclusions
In this paper, we presented a new parallel clustering design of a recently appeared method, named Transitive heuristic. The proposed method processes large categorical data based on the relational analysis approach. We described in detail Transitive heuristic and its new parallel design, named PMRTransitive, and then we discussed the adjustments that were brought to the original heuristic in order to adapt it to the MapReduce framework. Finally, we evaluated the quality of the clustering results produced while using reallife data sets. The results demonstrate that both methods, transitive and PMRTransitive, are efficient and produce clusters of good quality.
In our future work, we plan to extend the PMRTransitive heuristic to manage multiple data types and to treat outliers while keeping its characteristics and benefits. We also hope to test this method on a multinodes cluster environment in order to evaluate its performance with respect to Big Data.
Declarations
Authors' contributions
SCS first proposed the design of the original serial method, which is called Transitive. YL carried out its implementation and evaluation. Additionally, she suggested the parallel implementation of the Transitive heuristic as a singlepass method using the MapReduce model. Both authors read and approved the manuscript.
Acknowledgements
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
The reallife data sets supporting the conclusions of this article are available in [UCI Machine Learning Repository. http://archive.ics.uci.edu/ml].
The synthetic data sets supporting the conclusions of this article were produced using the generator available in [The datgen Dataset Generator. http://www.datasetgenerator.com]
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Funding
Not applicable.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design and implementation: 06–08 December 2004; San Francisco. Berkeley: USENIX Association. 2004. p. 137–50.Google Scholar
 Berkhin P. A survey of clustering data mining techniques. In: Kogan J, Nicholas C, Teboulle M, editors. Grouping multidimensional data. Heidelberg: Springer; 2006. p. 25–71.View ArticleGoogle Scholar
 Slaoui SC, Lamari Y. Clustering of large data based on the relational analysis. In: Proceedings of the international conference on intelligent systems and computer vision. 25–26 March 2015; Fez. Washington: IEEE Computer Society. 2015. p. 1–7.Google Scholar
 Michaud P, Marcotorchino JF. Modèles d’optimisation en analyse des données relationnelles. Math Sci Hum. 1979;67:7–38.MATHGoogle Scholar
 Zhao W, Ma H, He Q. Parallel kmeans clustering based on mapreduce. In: Proceedings of the first international conference on Cloud Computing. 1–4 December 2009; Beijing. Heidelberg: SpringerVerlag. 2009. p. 674–79.Google Scholar
 He Y, Tan H, Luo W, Mao H, Ma D, Feng S, Fan J. Mrdbscan: an efficient parallel densitybased clustering algorithm using mapreduce. In: Proceedings of the 17th international conference on parallel and distributed systems: 7–9 December 2011; Tainan. Washington: IEEE Computer Society. 2011. p. 473–80.Google Scholar
 Kim Y, Shim K, Kim MS, Lee JS. DbcureMR: an efficient densitybased clustering algorithm for large data using mapreduce. Inf Syst. 2014;42:15–35.View ArticleGoogle Scholar
 Parmar D, Wu T, Blackhurst J. MMR: an algorithm for clustering categorical data using rough set theory. Data Knowl Eng. 2007;63(3):879–93.View ArticleGoogle Scholar
 Bai L, Liang J. The kmodes type clustering plus betweencluster information for categorical data. Neurocomputing. 2014;133:111–21.View ArticleGoogle Scholar
 AhPine J, Marcotorchino JF. Overview of the relational analysis approach in datamining and multicriteria decision making. In: Usmani ZUH, editor. Web intelligence and intelligent agents. Rijeka: InTech; 2010. p. 325–46.Google Scholar
 Michaud P. Condorcet—a man of the avantgarde. Appl Stoch Models Data Anal. 1987;3(3):173–89.View ArticleMATHGoogle Scholar
 Manning CD, Raghavan P, Schütze H. Evaluation in information retrieval. In: Introduction to information retrieval. New York: Cambridge University Press. 2008. p. 151–75.Google Scholar
 Hubert L, Arabie P. Comparing partitions. J Classif. 1985;2(1):193–218.View ArticleMATHGoogle Scholar
 UCI Machine Learning Repository: data sets. https://archive.ics.uci.edu/ml/datasets.html.
 Guha S, Rastogi R, Shim K. Rock: a robust clustering algorithm for categorical attributes. Inf Syst. 2000;25(5):345–66.View ArticleGoogle Scholar
 He Z, Xu X, Deng S. Squeezer: an efficient algorithm for clustering categorical data. J Comput Sci Technol. 2002;17(5):611–24.MathSciNetView ArticleMATHGoogle Scholar