On K-means clustering-based approach for DDBSs design

Amer, Ali A.

doi:10.1186/s40537-020-00306-9

Research
Open access
Published: 11 May 2020

On K-means clustering-based approach for DDBSs design

Ali A. Amer ORCID: orcid.org/0000-0002-2002-948X¹

Journal of Big Data volume 7, Article number: 31 (2020) Cite this article

2851 Accesses
17 Citations
Metrics details

Abstract

In Distributed Database Systems (DDBS), communication costs and response time have long been open-ended challenges. Nevertheless, when DDBS is carefully designed, the desired reduction in communication costs will be achieved. Data fragmentation (data clustering) and data allocation are on popularity as the prime strategies in constant use to design DDBS. Based on these strategies, on the other hand, several design techniques have been presented in the literature to improve DDBS performance using either empirical results or data statistics, making most of them imperfect or invalid particularly, at least, at the initial stage of DDBSs design. In this paper, thus, a heuristic k-means approach for vertical fragmentation and allocation is introduced. This approach is primarily focused on DDBS design at the initial stage. Many techniques are being joined in a step to make a promising work. A brief yet effective experimental study, on both artificially-created and real datasets, has been conducted to demonstrate the optimality of the proposed approach, comparing with its counterparts, as the obtained results has been shown encouraging.

Introduction

During the last years, a significant progress has been made in DDBS design. Mostly, this progress has been concentrated on fragmentation and allocation techniques due to their critical impact on DDBS productivity, particularly in relational databases. On one extreme, the fragmentation process (horizontal, vertical or mixed) describes how each relation could be split into different data fragments (smaller relations). On the other extreme, data allocation seeks to promote DDBS performance by placing the properly-broken fragments into their relative sites in which they are most needed. Consequently, when data fragmentation and allocation are well performed, DDBS throughput is substantially optimized. This optimization is often met by promoting performance through minimizing the irrelevant access for data (i.e. transmission minimization), which is already stored in different sites, as distributed query under processing. Briefly, paper’s contributions are summarized as follows:

1.
Developing K-means clustering based vertical fragmentation method in the relational database context. Unlike most of earlier techniques, this work does not need data statistics, empirical results, mid-term predicates, affinity, attributes affinity matrix or even query frequency matrix to perform data fragmentation and allocation, at least, at the initial stage. What is just taken: the considered queries as the most frequently used, and the Query Usage Matrix (QUM) in which each point refers to whether a specific site releases the relevant query or not. In fact, this step marks the novelty and creativity of the proposed work as it is essentially committed to the initial stage of DDBS design.
2.
Proposing a novel algorithm for the fragments refinement process. This algorithm produces the non-overlapping schemes out of the overlapping schemes generated from the clustering process.
3.
Compiling many techniques into the proposed approach to make an effective work so the data locality maximization and communication costs reduction are met. Among these techniques are: K-means-based process for query clustering, schemes refinement process, fragmentation evaluation technique, site clustering process, and data allocation and replication. Consequently, a competitive DDBS design approach is expected to meet the acquired performance of DDBS in either a static or dynamic environment.
4.
Finally, the proposed work of this paper has been evaluated, on both artificially-created and real datasets, against two counterparts in DDBS design literature. Experimental results illustrated a significant performance for the proposed work comparing with its peers.

The rest of this paper is structured as follows. In “Related work” section, the earlier relevant studies of DDBS design are explored. The proposed methodology including the approach’s heuristics and architecture, fragmentation and allocation cost model and clustering process, is elegantly given in “Proposed methodology” section). Results and discussion are presented in “Results” and “Discussion” section. Finally, “Conclusions and future work” draws the conclusions along with future work.

Related work

According to the literature, the fragmentation techniques are horizontal, vertical and hybrid. While fragmentation is often done independently of data allocation. The data allocation process, however, is always heavily contingent on the fragmentation process. In other words, it is done on the assumption that fragmentation is antecedent. In its turn, vertical fragmentation (VF) grabs DDBS researchers’ attention. In fact, this is back to the positive effects vertical fragmentation has on DDBS rendering. As first of its kind in the DDBS field [1], came as a fine-grained taxonomy on DDBS design. The basic issues examined in this taxonomy were data fragmentation and allocation. Data replication was significantly scrutinized as well. This taxonomy was comprehensively analyzed that all these issues were considered to classify and analyze a big number of the previous works. The driving aim of this taxonomy was to take the observation of earlier works’ drawbacks to increase the likelihood of producing more productive methods to improve DDBS performance. The decrease of transmission costs (TC), involving the costs of communication, has been the objective for which most of DDBS design works have been striving to meet. However, to meet this objective, the DDBS design work has to maximize the data locality and minimize the access for remote data significantly. It was observed in [1] that most of the studied works failed to draw a clear “unified or consensually agreed-upon” definition for TC as a metric for DDBS performance which is considered a huge shortcoming.

A simultaneous relational-model-based vertical fragmentation and allocation technique was proposed along with a cost model in [2]. Communication costs minimization was the prime motivation of the work. However, when authors performed fragmentation, there has been no involvement for any cost model to evaluate the resulting fragmentation solution(s) due to the fact that only one single solution is set to be produced regardless of its quality. Moreover, authors did not consider the site clustering or distinguish given for the reading and write queries. Finally, the replication strategy had not been addressed as well. While in [3], the DAP problem was sought to be solved through the hybrid solution using the algorithm of differential evolution (DE) along with the technique of variable neighborhood search (VNS). The key intention revolved around promoting DE rendering through the operators of selection and crossover. The given work was experimentally seen effective as it explored the search space by DE along with the technique of neighborhood search. On the other extreme, hypothesizing the existence of interlocking horizontally-fragmented data, the data replication problem (DRP) was deeply dealt with in [4] as an integer linear problem. Hence, data replication was addressed as the problem of optimization for the sake of keeping copies of fragments and sites at a minimum. On the same line [5], developed a particle swarm optimization-based method (PSO) to reduce TC costs. The aim was to use PSO to find a solution for the data allocation problem (DAP) only.

While most of the earlier work used an attribute affinity as the key element to fragment data, there have been many clustering-based fragmentation techniques. In [6], a heuristic technique for vertical fragmentation and data allocation was elegantly evolved. The work was the first of its kind that sought to incorporate several techniques in one single work with the aim of maximizing DDBS performance. Extensive evaluation on several data allocation scenarios was performed to assert the proposed work’s effectiveness. As a follow-up optimization [7], came to add a new data allocation scenario to the work of [6]. This scenario was shown to be non-efficient in some cases in which update queries grow steadily, though. Moreover [8], came to further enhance DDBS performance by proposing a new approach based on an aggregated similarity measure used to cluster queries. The authors proposed a greedy algorithm to solve the data allocation problem. A comprehensive evaluation was promised to be made with [6] to assert the proposed work’s superiority. No evaluation was given yet, though. On the same line [9], evolved an enhanced technique to design DDBS. This work was also evaluated against [6] and shown to behave slightly better in most cases. Moreover, an acceptable experimental study was conducted to prove the concept. Following the same of pattern of [6, 9, 10] came to propose an enhanced scheme for vertical fragmentation and allocation. The aim of the work was to improve DDBS performance through finding an influential solution for the round-trip response time minimization. Comparing with the relative works, author claimed that their proposed schemes decreased the round-trip response time by 23%.

On the other hand [11], worked on finding a vertical fragmentation method. An algorithm based on differential bond energy (DBE) was proposed. Based on the global affinity measure (GAM), a comparison was made for the proposed algorithm with a classical bond energy algorithm (BEA) in terms of performance. The experimental results attested that developed DBE was suitable for the high dimensional problems with having a high GAM value comparing with BEA on several datasets. For improving DDBS performance [12], proposed a non-redundant dynamic fragment allocation approach. This approach amended the read and write data volume to Threshold Time Volume and Distance Constraints Algorithm. Fragments were being allocated based on access patterns made to each fragment. On the same page, an enhanced approach to split data at the initial stage of DDBS design and then assign data at runtime over the cloud environment was proposed in [13]. The data replication scenario was adopted in a way that allowed DDBSMs to work simultaneously to meet the client’s orders. In [14], a method to boost the performance and reliability of distributed systems was presented. The method sought to find the optimum placement in network nodes of the information and technological reserve (ITR). This method used information and software redundancy in the form of distributed copies of ITR. It was mathematically shown able to boost the reaction of DDPS. It established that after setting ITR copies and allocating them over the network, these copies served as an information base for DDPS when requests of users are being processed.

Finally, in the same line for solving the DAP problem, [15] came to present a greedy based algorithm called ASGOP to tackle DAP. The data allocation was treated as an optimization problem and the cost model solved using the knapsack algorithm. In each time, each fragment was not allocated to the intended site unless it was guaranteed that this site is the prime container based on its transmission costs. Two data allocation scenarios were addressed, the replication-based and non-replication based. The experiment results shown that ASGOP outperformed its counterparts in terms of data allocation due to its being greedy.

Proposed methodology

Requirements

To perform data fragmentation and allocation, the next information requirement is needed:

A set of relations of Database (R₁, R₂,……, R_r)., where (r) represents the number of considered relations.
For each Ri (A1; A2;…; An): is the data schema, R, which consists of (N) attributes.
A data query set running against R_i, Q (Q₁; Q₂;…; Q_q), where (q) is the number of running queries.
Query Access Matrix: each QAM value refers to whether query Q_k is released from site Sj or not, be given by DBA. Where (k and j) are just indices for query and site respectively.

Motivations to prefer K-means over hierarchical clustering (HC)

Hierarchical clustering (and its variations) is efficiently used in applications where points/patterns are at the range of tens and even hundreds and it has long proven effective [6, 7, 9]. Nevertheless, as the size of data sets is increasing, HC is being infeasible due to its non-linear time that grows exponentially with dataset size and the growing demands of space required. As a matter of fact, it is not an easy task to visualize a dendrogram for, let say, “1000” patterns (and not to mention the complexity involved when patterns in thousands). To accurately examine the number of patterns in HC, an exponential time is required to perform the task at hand. To sum up, HC (and its variation) does not scale up perfectly in the context of large-scale applications that would involve thousands and even millions of patterns. On the other hand, naturally, K-means is a hard clustering algorithm that is adequately applied often on large datasets and it has long proven efficient in literature [16]. Moreover, there has been a dominant property features k-means algorithm which is its ability to successively minimize the sum of patterns squared deviations implicitly (called in literature, squared-error criterion) from the center in each cluster. Formally speaking, assuming there has been cluster X_i and ${\mathcal{M}}_{i}$ is its center, then the criterion function that sought to be minimized by k-means is drawn in Eq. (1):

$$\mathop \sum \limits_{I = 1}^{\# clusters} \mathop \sum \limits_{j = 1}^{\# X} \left( {X\left( j \right) - {\mathcal{M}}\left( i \right)} \right)^{T} \left( {X\left( j \right) - {\mathcal{M}}\left( i \right)} \right)$$

(1)

On the other hand, as a crucial drawback, the k-means algorithm does not secure the globally-optimal fragments due to two basic causes: (1) the poor selection for initial seeds (centers), and (2) The traditional k-means algorithm which leverages the “winner-take-all” technique as pattern given to only and only the winning cluster to eventually generate the hard fragment. So, to tackle this shortcoming and enhance the results, K-mean is being applied according to the next mechanism: initial seeds were chosen heuristically. The widely-known heuristic is to pick up the initial “k” centres. These centers are supposed to be as far away from each other as possible. In literature, this heuristic worked well practically. In practice, picking the pair of patterns which are greatly dissimilar in the set as the initial seeds leads to decrease the dependency on the initialization process. Experimentally speaking, this mechanism serves the interest of the proposed wok substantially in terms of finding a competitive solution for DDBS design efficiency. This efficiency is obviously being reflected by the better results that come in favour of K-means-based work comparing with its counterparts. It is worth indicating that there have been other alternatives for centroid selection in literature like Random generation, Buckshot approach and ranking technique [17, 18]. Some of these strategies are also tested in our work, but no one draws better results and serves the major interest of paper like the “greatly dissimilar” strategy that is already being leveraged.

Finally, the time complexity is O(NKID), where N is the number of considered patterns, K is the number of generated clusters, I is the iterations number and D is the dimensionality. The space requirement is O (KD). This complexity is lower than HC complexity making K-means largely appealing to be used for DDBS design.

Heuristics

Our work is a three-fold approach as drawn in Fig. 1, and detailed as follows

The phase (1): the query set, most-repeatedly-run queries, was identified. For each query in the set, the contained attributes replaced by binary value (0, 1) distinguishing its presence or absence in the query. In doing so for all queries, Attribute Access Matrix (AAM) was constructed so that its rows represented queries and column represented attributes (see Table 4). This matrix was used with the help of the hamming distance metric [19] to find the different values among patterns that were already being recognized. These difference values would be drawn into a matrix called Query Difference Matrix (QDM). Then, using QDM as initial input, the K-means clustering process was being activated as presented in the “Clustering methodology” section.

The phase (2): the refinement process was drawn to guarantee securing the non-overlapping fragments. All fragmentation combination has to be created for each cluster. The input parameters for this process were all over-lapping (PSs) out of the fragmentation phase. So, the expected results of this process were bound to be the non-overlapping schemes. However, when query clusters were examined, it happened that attribute(s) went missing in some clusters. Such attribute(s) missed through the clustering process due to the loss of some queries in each cluster chiefly as clusters are aggressively grown. Therefore, since the prime goal of this approach was to keep as high percentage of binding “connection” among attributes regarding their relevant original queries as possible, this attribute(s) would be added according to the proposed function called the affinity function (aff_func(Partition, attribute/attributes), Eq. (2)). For each cluster, this function would check the strength of the connection of attribute(s) with all partitions of each Partitioning Schema (PS) individually based on Eqs. (3) and (4) at the same time. Then, whenever happened that certain partition(s) had the max connection with attribute(s) at the question, it was the prime candidate container to store them. Nevertheless, if an attribute(s) has been equally required by “N” partition in PS, it was added “N” time(s). In each time, attribute(s) was being added to each partition making a new PS in each addition.

A clear manifestation is drawn in Table 15 so that p3 and p4 were derived from the original PS3. This connection was calculated based on the attribute’s appearance, in each partition of each cluster, concerning their relevant original queries. In the sense that any partition yielded a higher connection with the relevant missing attribute(s), it was the candidate partition to store it. However, if an attribute(s) had a zero connection, it was created as a new partition on its own inside the underlying PS. The proposed Function of affinity was presented in Eq. (2) as follows;

$$func - aff\left( {partition,\frac{{{\text{A}}}}{{{\text{As}}}}} \right) = \left\{ \begin{aligned} & add \left( {{{{\text{A}}}}, {{{\text{partition}}}}} \right),\quad con = T \hfill \\ &CNP \left( {\frac{{{\text{A}}}}{{{\text{As}}}}} \right),\quad con = F \hfill \\ \end{aligned} \right.$$

(2)

where P stands for concerned partition, A is a shortcut for Attribute/attributes, the con is a logical factor to distinguish whether there had been the connection or not, and CNP stands for creating a new partition. T and F stand for true and false flags. By strictly following this procedure, the possibilities of getting PSs with a minimum of the remote access costs and maximum of data locality has been increased. In other words, among all generated combinations, only schemes of a high percentage of connection rate would be taken into fragmentation evaluator. However, to select those schemes, an optimality measure (OM) “parameter” was proposed. Actually, OM was a criterion to reflect the recorded correlation rate of access costs between each PS and all their relevant considered queries, Eqs. (3) and (4).

$$Optimality \;Measure \;\left( {OM} \right) = 1 - Correlation \;Rate$$

(3)

$$Correlation \;Rate = \frac{{\mathop \sum \nolimits_{i = 1}^{ps} \mathop \sum \nolimits_{j = 1}^{q} Cr_{ + + } }}{Q}$$

(4)

where Cr is an integer counter. The Correlation rate, in its turn, was used to reflect the maximum remote access for all queries to reach that relevant PS. In the sense that PS which gave higher remote access, it was neglected. Whenever OM was bigger the remote access was in fact minimized and local access was maximized. Consequently, a proportion of technique design objective was met. Finally, the ultimate decision to exclude or include PS into FE was accomplished as per Eq. (5). Figure 2 depicts the steps of the process professionally.

$$Decision \;Making \left( {PS} \right) = \left\{ {\begin{array}{*{20}c} {OM \ge 50\%, \quad Include \;PS \;into FE } \\ {Otherwise, \quad exclude \;PS from \;FE } \\ \end{array} } \right.$$

(5)

Finally, for this phase, if it happened to have schema duplication, only one copy was kept. This phase keeps only one copy of each schema if it happened to have schema duplication.

The phase (3): the evaluation step was done using the presented fragmentation evaluator (FE) [6]. FE uses two measures to assess schemes which were: the relevant remote access and irrelevant local access. Basically, according to [6], the successful partitioning schema is that of the lowest value of FE. This schema would be considered for the allocation process, as a result.

Fragmentation and allocation cost model

Based on the running queries, the matrix of attribute access (AAM) is being concluded. In this matrix, each aam_ij signifies the approaching of Ai by Q_k. with the assumption that the query usage matrix (QUM) was already supported by DBA [6] so that each qum_ij indicates the site S_j, from which Q_k was launched. Hence, using these requirements, the process of data fragmentation and allocation is done based on the next functions:

$${\text{Similarity (Q}}_{k1} , Q_{k2} ) = \mathop \sum \limits_{k1 = 1}^{q} \mathop \sum \limits_{k2 = 1}^{q} ( 1\;{ - }\;{\text{difference(P((Q}}_{k1} ), P\left( {Q_{k2} } \right)) ,$$

(6)

where “Sim”, “dif” and P(Q) stands for similarity, the difference between queries, and the numerical pattern of Q_k respectively. Using Eq. (6), query difference matrix (QDM) is being constructed as seen in Eq. (7). Each qdm_k1k2 represents the similarity calculated between each query pair.

$${\text{QDM}}_{k1k2} = \mathop \sum \limits_{k1 = 1}^{q} \mathop \sum \limits_{k2 = 1}^{q} {\text{Similarity (Q}}_{k1} , Q_{k2} )$$

(7)

Proposed allocation and replication model

Suppose we have a “K” query set, Q = {Q₁, Q₂,…, Q_k} reach N attributes A = {A₁, A₂,…, A_n}. These queries were tied into CN several clusters {Cq₁,Cq₂…., CQ_cn}. Query clusters were placed into a set of M sites S = {S₁, S₂, …., S_m}. Sites, in their turn, were clustered into CM clusters Cs = {Cs₁, Cs₂, …., CS_cm}, and F = {F₁, F₂,…, F_m} be the disjointed fragments already produced out of the process of query clustering. Then, the proposed model of data allocation pursues to optimally distribute each fragment (F) over cluster set, Cs, and then over sites of each cluster.

Allocation scenarios

First scenario: first Phase (replication adopted): each (F) was replicated over all clusters. It is worth indicating that the replication concept of the proposed work has adopted the replication principles given in [4] to replicate the data when it is needed.

Second scenario: First Phase (non-replication adopted): each (F) was assigned to C of the maximum cost of access.

Both scenarios: Second Phase 2 (no replication inside each cluster): The total cost to each S_j, to reach all A(s) of F_i was the assignment controller. So, the total cost of access of each S_j (NACS_ij) has to be precisely computed. NACS matrix was created using both site attribute access (SAAM), which was computed in Eq. (8), and communication costs (CMS) matrices, see Eq. (9). In NACS, the site of maximum cost for the intended fragment was selected as the candidate site to store the fragment at the question.

Allocation cost functions

$$SAAM = \mathop \sum \limits_{k = 1}^{q} \mathop \sum \limits_{j = 1}^{m} \mathop \sum \limits_{i = 1}^{n} AAM_{ik} * QUM_{ji}$$

(8)

$$NACS = \mathop \sum \limits_{j1 = 1}^{m} \mathop \sum \limits_{i = 1}^{n} \mathop \sum \limits_{j = 1}^{m} SAAM_{ij1} * CMS_{ji}$$

(9)

$$NACC = \mathop \sum \limits_{k = 1}^{cn} \mathop \sum \limits_{i = 1}^{n} \mathop \sum \limits_{j = 1}^{m} SAAM_{ij1} + 1$$

(10)

$$PACM = \mathop \sum \limits_{k1 = 1}^{cn} \mathop \sum \limits_{i = 1}^{n} \mathop \sum \limits_{k = 1}^{cn} NACC_{ik1} * CCM_{ki}$$

(11)

where AAM, QUM, CMS, and CMS stands for attribute access matrix, query usage matrix, communication costs between sites and communication costs between clusters respectively. Equation (8) built SAAM which used later to build the NACS matrix along with using Eq. (9). Equation (10) accumulated the access costs for each attribute over its relevant clusters with respect to sites contained in each cluster. Lastly, Eq. (11) drew the last step of fragments allocation over each site cluster in the second scenario. Last but not least, the following constraints were maintained throughout the data allocation process. It is worth indicating that this cost model (including equations) has been solved using integer linear programming (ILP) as the objective function of the whole work is to maximally minimize transmission costs.

$$\mathop \sum \limits_{k = 1}^{cq} Size \left( {F_{k} } \right) \le Capacity \left( {S_{j} } \right), \quad \forall j = 1, \ldots , m.$$

(12)

$$LA_{i} \le \mathop \sum \limits_{i = 1}^{n} X_{ij} \le UA_{i} , \quad \forall j = 1, \ldots , m.$$

(13)

$$X_{ij} \in \left( {0,1} \right),$$

(14)

Constraint (12) ensured that the net size of fragments that already assigned to one site must not overpass the site capacity, as shown in Table 1. On the other hand, constraint (13) guaranteed that the number of assigned attributes was between the lower limit of the allowed attribute (LAL) and the upper limit (UAL). Finally, constraint (13) was the decision variable on the binary form. Table 1 describes these constraints as capacity measured in Megabyte, LAL, and UAL.

Table 1 Site constraints

On K-means clustering-based approach for DDBSs design

Abstract

Introduction

Related work

Proposed methodology

Requirements

Motivations to prefer K-means over hierarchical clustering (HC)

Heuristics

Fragmentation and allocation cost model

Proposed allocation and replication model

Allocation scenarios

Allocation cost functions

Fragmentation evaluator (FE)

Clustering methodology

K-means clustering process

Site clustering algorithm

Results

K-means-utilizing clustering process

Second loop (Q7,Q2)

Refinement process

Fragmentation evaluation

Allocation process

Discussion

Conclusions and future work

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Authors’ information

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Second loop (Q₇,Q₂)