With the recent growth and advancement in Information Technology, data has produced at a very high rate in a variety of fields, which have presented to users in a structured, semi-structured, and non-structured mode [1]. New technologies for storing and extracting useful information from this volume of data (big data) have needed because the discovery and extraction of useful information and knowledge from this data volume are difficult, hence, other traditional relational databases cannot meet the needs of users [2]. If you are dealing with data beyond the capabilities of existing software, you are, in fact, dealing with big data. Large data is commonly referred to as a set of data that exceeds the extent to which it can be extracted, refined, managed, and processed by standard management tools and databases. In other words, the term “big data” refers to data that is complex in terms of volume and variety; however, it is not possible to manage them with traditional tools, and therefore, they cannot extract their hidden knowledge and knowledge at predetermined times [3, 4].
Big data is, therefore, defined with three attributes of volume, velocity, and variety that are called Gartner’s commentary; some scholars have in addition; IBM cited the fourth attribute and added ‘veracity’ for big data. Zikopoulos et al. [5] described that “V” or veracity dimension, which is “in response to the quality and source issues our clients began facing with their Big data initiatives”. Also, Microsoft for the reason of maximizing the business value, 3 other dimensions added to Gartner’s dimensions (3Vs) and called 6Vs, include variability, veracity, and visibility along with 3Vs [6]. Yuri Demchenko added the value dimension together with IBM 4Vs [7].
Referring to the earlier explained challenges, researchers are trying to create structures, methodologies and new approaches for managing, controlling and processing this volume of data, which has led to the use of data mining tools. One of the important methods of data mining is clustering (cluster analysis), which is an unsupervised method for finding clusters with maximum similarity within a cluster, and at least similarity between clusters that are parallel to each other, without prediction in the similarity of things. By the growing of databases, researchers’ efforts are focused on finding efficient and effective clustering methods to provide a quick and consistent decision-making ground that could be applied in a real-world scenario [8]. Clustering methods are divided into five categories: partially based, hierarchical, density-based, model-based, and network-based [9, 10].
The emphasis of this paper is on the density-based clustering method (DBSCAN), presented by Martin and colleagues in 1996; defining of the cluster is based on two parameters ε and minPts [11]. In this method, clusters are defined as dense regions of the set. Objects in low-density regions separate the clusters, such that these objects can be referred to as points of noise or boundary. Density-based clustering algorithms use the density property of points to partition them into separate clusters, to find out arbitrarily shaped clusters as well as to distinguish noise from large spatial datasets. It defines a cluster as a region of densely connected points separated by regions of non-dense points [12]. It accepts two parameters namely eps (radius-ε) and minPts (minimum points-a threshold). A point’s density in the datasets is estimated by counting the number of points within a specified radius (eps). This allows us to classify any point as either a core point, a border point, or a noise point. The main idea is that for each point of a cluster, the neighborhood of a given radius (eps) has to contain at least a minimum number of points (minPts) [13].
Density-based algorithms provide advantages over other methods through their noise handling capabilities and their ability to determine clusters with arbitrary shapes.
Algorithm description [13]
-
1.
Choose a random point p.
-
2.
Fetch all points that are density-reachable from p with respect to eps and minPts.
-
3.
A cluster is formed if p is a core point.
-
4.
Visit the next point of the dataset, if p is a border point and none of the points is density-reachable from p.
-
5.
Repeat the above process until all the points have been examined.
As mentioned above, Big Data is big converse to operate on and so it is a big challenge to perform analytics on big data. Cloud computing is becoming the basis for Big Data needs.
Cloud computing is a powerful technology that enables opportune and on-demand network access to a shared pool of configurable computing resources [1, 14]. It can be defined as a parallel and distributed system, consisting of a collection of interconnected and virtualized computer systems. These systems are presented as one or more unified computing resources. Cloud computing services are usually provided in three categories: Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS). In cloud computing, providers cooperate to provide cloud services and resources for customers. A customer acquires and releases cloud resources by requesting and returning virtual machines (VMs) in the cloud [3, 14].
Cloud computing is a scalable technology rendered for developing world adoption, helping lower costs, expanding operation flexibility and improving the speed of service [3, 15]. In addition, it is a viable technology to perform big data and complex computing. It is the IT base for Big Data needs and becoming an exigency for big data processing and analysis [1, 16].
Apache Hadoop is a Java based open source software framework meant for distributed processing of very large dataset across thousands of distributed nodes. A Hadoop cluster divides data into small parts and distributes them across the nodes. Doug Cutting and Mike Cafarella originally created the Hadoop framework in 2005 [10, 17]. Apache Hadoop is developed to scale up from the single server to in cluster of multiple machines, each of these offering its own (local) computation and storage capabilities [1]. Structurally, Hadoop is a software infrastructure for the parallel processing of big data sets in large clusters of computers. The inherent property of Hadoop is the partitioning and parallel processing of mass data sets. Hadoop is based on MapReduce programming which is suitable for any kind of data.
MapReduce is a framework for implementing distributed and parallel algorithms in datasets [18]. This framework was introduced by Google in 2004 to support distributed processes on a distributed datasheet across clusters of computers. The model follows the rule of split and overcome. Thus, dividing the input data sets into separate pieces that are processed in parallel to the mapping phase. Then the sorting operations of the mapping outputs are performed by the framework and used as inputs for the reduction phase. These operations are carried out in three phases: the mapping phase, the sorting phase and the reduction phase [19]. The main idea of MapReduce is to divide the data into fixed-size chunks which are processed in parallel which takes advantages from that. Also, it is designed to avoid computer node failure issues (fault tolerance) [20, 21].
Spark is an open-source big data framework [22]. It provides a faster and more general-purpose data processing engine. Spark is basically designed for fast computation. It also covers a wide range of workloads—for example, batch, interactive, iterative, and streaming. Despite Hadoop and MapReduce have the same purpose of processing big datasets, but the spark is faster than Hadoop in executing map and reduce jobs. The whole performance of a specific platform can’t be on a single indicator. Hadoop may be used over the spark for the following reasons: low-cost hardware, lower processing speed in a larger set of data sources, fault tolerance, and the capacity to manage large datasets with the help of an HDFS [23]. The selection of one big data platform over the others will come down to the specific application requirements, whereas our data set is big and Hadoop is better than a spark in processing speed for a larger set. Furthermore, for massively large datasets, in fault tolerance Spark was reported to be crashed with JVM heap exception while Hadoop still performed its task. So, in this study, we prefer to use Hadoop in lieu of spark [24, 25].
The DBSCAN algorithm, along with its benefits, is not very effective in detecting clusters by varied density, and in big data clustering, it is challenging to set the minPts for each data and the processing power of a machine. Consequently, the operation and power implications of running density-based clustering for big data with a variety of density, mainly in the theme of Hadoop, in the cloud environment are not yet to be well considered.
In this paper, we have two targets: first, to propose a method of clustering big data sets with varied density. Second, running the algorithm on the MapReduce environment. With the development of big data, cluster analysis in financial areas, marketing information retrieval and data filtering is widely used.
The rest of the paper has structured as follows; Segment number 2 covers the literature insights of clustering algorithms in relation to this research. The proposed method MR-VDBSCAN has introduced in segment number 3. Segment number 4 explores the result and discussion of the proposed method, followed by the conclusion in segment number 5.