Minimum threshold determination method based on dataset characteristics in association rule mining

Association rule mining is a technique that is widely used in data mining. This technique is used to identify interesting relationships between sets of items in a dataset and predict associative behavior for new data. Before the rule is formed, it must be determined in advance which items will be involved or called the frequent itemset. In this step, a threshold is used to eliminate items excluded in the frequent itemset which is also known as the minimum support. Furthermore, the threshold provides an important role in determining the number of rules generated. However, setting the wrong threshold leads to the failure of the association rule mining to obtain rules. Currently, user determines the minimum support value randomly. This leads to a challenge that becomes worse for a user that is ignorant of the dataset characteristics. It causes a lot of memory and time consumption. This is because the rule formation process is repeated until it finds the desired number of rules. The value of minimum support in the adaptive support model is determined based on the average and total number of items in each transaction, as well as their support values. Furthermore, the proposed method also uses certain criteria as thresholds, therefore, the resulting rules are in accordance with user needs. The minimum support value in the proposed method is obtained from the average utility value divided by the total existing transactions. Experiments were carried out on 8 specific datasets to determine the association rules using different dataset characteristics. The trial of the proposed adaptive support method uses 2 basic algorithms in the association rule, namely Apriori and Fpgrowth. The test is carried out repeatedly to determine the highest and lowest minimum support values. The result showed that 6 out of 8 datasets produced minimum and maximum support values for the apriori and fpgrowth algorithms. This means that the value of the proposed adaptive support has the ability to generate a rule when viewed from the quality as adaptive support produces at a lift ratio value of > 1. The dataset characteristics obtained from the experimental results can be used as a factor to determine the minimum threshold value.


Introduction
The increase in data usage led to large quantities of data growth which were accessible from various locations everywhere at all times [1]. Data availability is a huge asset for an organization because it contains a lot of useful information. Therefore, Data Mining is a way to extract this useful information and patterns [1][2][3][4][5][6]. Furthermore, the Association rule is one of the most basic study topics widely used in Data Mining [1][2][3][4][5][7][8][9][10][11]. It is an important data analysis method that discovers and identifies interesting relationships between sets of items in a dataset and predicts association relationships for new data [7,8,12].
The basic concept of the association rule is to generate rules based on items with frequent occurrence in a transaction. This includes two main processes, namely the determination of frequent itemset and the process of forming rules. A frequent itemset is a collection of items that have a higher frequency of occurrence compared to the threshold value specified in the transaction. This value is also known as the minimum support. The real-time applications of the association rule have several challenges ranging from the presence of very large, multiple, and heterogeneous data sources to difficulty in the determination the value of minimum support [7]. Furthermore, the first step in the association rule is to determine the minimum support value [13,14]. Currently, this value is determined by the user. Currently, this value is determined by the user. This is done because the user is considered to know the most about the dataset. In addition, the user can set limits to what extent and how much of the desired output. In fact, not a few users have difficulty in determining the value of minimum support. Another difficulty is that the user is ignorant of the dataset characteristics to be searched during this process [3,6,11,[15][16][17][18].
The minimum support value has an important role in determining the value of several rules generated [3]. However, determining a wrong value leads to the failure of the association rule to obtain the required rule [19]. Furthermore, determining a value that is too low leads to the involvement of too many items in the rule formation process. Conversely, when a value that is too high is obtained, fewer items are involved which leads to the loss of a lot of information. The minimum support values when compared in terms of two factors namely, time and memory showed that a value that is too low will require more of these two factors when compared to a higher value [6,11,20]. Currently, the method used to determine the value of minimum support is based on the intuitiveness of the user. The association rule process is repeated through the change of the value of minimum support when the rule obtained does not match. However, this does not guarantee that a rule will be generated with the input minimum support value.
Decision-makers are required to make the right strategic choices which lead to the adjustment of current conditions which are full of volatility, uncertainty, complexity, and ambiguity. This is known as the VUCA (Volatility, Uncertainty, Complexity, Ambiguity) world [21]. Furthermore, the current conditions also demand the adaptive transformation of the association rules to meet the needs of the users. Decision-makers have different criteria which are used to obtain the information they need. Therefore the rule formation process should pay attention to the criteria desired by the users to obtain adaptive Rules [22]. Furthermore, to meet these needs, the determination of frequent itemset should not only be based on the frequency of occurrence of an item but also involve another criterion called item utility [10,[23][24][25]. Therefore, the resulting rule will be adaptive according to the criteria desired by the user.
This study which proposes a method to determine the value of minimum support based on the characteristics of the dataset and other criteria can also be used as assessments in the process of forming the rule. In the proposed method, the user is not required to determine the minimum support value at the beginning, because the minimum support value is calculated automatically based on the characteristics of the dataset. In addition, the determination of the minimum threshold value is not only based on the frequency of occurrence of items but involves other criteria that are factors in the rule formation process. With this method the rule formation process becomes more adaptive according to user needs. The proposed method is only up to the process of determining the minimum threshold for selecting frequent itemset. After the frequent itemset is formed, the rule formation process can use various existing algorithms. With this proposed method, the user is not bothered with determining the value of minimum support and must repeat the rule formation process many times because the selection of the value of minimum support is not appropriate. The main contributions of this study include: 1. A new method for determining the value of minimum support based on the characteristics of the dataset, so that the user does not need to determine the minimum support value at the beginning. This method will automatically read and calculate each item as well as its frequency in the dataset and propose the appropriate minimum threshold value. 2. The generation of a minimum support value-based not only on the frequency of occurrence of items but also based on certain criteria affecting the rule formation process such as price, profit, or item usability value. By involving certain criteria, the rule formation process becomes more adaptive according to the needs and desires of the user. Users can focus on rules based on items that have high utility.
This study is divided into several parts. Part I contains the introduction and background of this research. The related work will be described in part II, proposed methods will be explained in part III, experimental results and discussion will be explained in part IV, conclusions will be explained in part V.

Association rule
The concept of association rule mining was initially introduced in a research paper by Agrawal [26,27]. This study developed a focus on different topics ranging from high utility itemset [23,25,[28][29][30][31], Top-K [5,11,32], Skyline [6,10,33], Multicriteria [34][35][36] and Meta association rules [37][38][39]. Furthermore, the following is a formal definition of the main concept of association rule mining: I = {i1, i2, . . ., im} is a set of items and D is a set of transactions T, where a set of transactions T is also a set of items, therefore, T ⊆ I. Furthermore, provided that A is a set of items, Transaction T is said to contain A if and only if A ⊆ T. Association rules are from A B, where A ⊆ I, B ⊆ I, and A ∩ B = Ø. In addition, rule A B has support in transaction set D provided that s% of transactions in D contain A ∪ B [4,22,40]. This support is shown by the occurrence frequency of items which is calculated by observing the ratio between the frequency of transactions containing itemset A divided by the total transactions. In general, it can be seen in the formula below [37]: Where, Supp is the support value; A is the antecedent of the rule in the form of itemset; B is consequent in the form of itemset; t is a transaction containing A and B; D is the total transaction.
Confidence is another threshold used in determining the rule apart from the support value. It is the ratio between the number of transactions containing items A and B divided by transactions containing item A for rule AB. The confidence value of rule A B is obtained by the formula [37]: Where, Conf is Confidence; A is the antecedent of the rule in the form of itemset; B is consequent in the form of itemset; t is a transaction containing A and B; D is the total transaction.
The general framework of the association rule is to extract a rule with a support value for an item that exceeds the minimum support and confidence value. Therefore, this rule exceeds the minimum confidence value specified by the user. In this case, it can be stated that A B is included in the frequent and confidence category (strong rule) provided that Supp(AB) ≥ minsupp and Conf(AB) ≥ minconf respectively.
To evaluate rule, lift ratio can be used [35,36,[41][42][43][44][45][46][47][48]. Lift ratio is the ratio between the support value of the rule with the antecedent and consequent support value. The higher of lift ratio value, the more interest rule or called the strong rule. A rule is called interest and strong rule if the lift ratio value > 1 because it shows that is a positive correlation between the premise and conclusion of this association rule [36,48]. The lift ratio value can be calculated by the formula [36, 41-43, 47, 48]: Where, Lift is Lift Ratio value; A is the antecedent of the rule in the form of itemset; B is consequent in the form of itemset; Supp is the support value; Conf is the confidence value.

Minimum support
The basic concept of the association rule requires the user to specify a minimum support value at the beginning. This value usually applies uniformly to all items, although in reality, different items may have different criteria for assessing them. Therefore, studies about multiple minimum support which state that the support value should vary for different items have emerged [9,23,49,50]. However, the implementation of this system adds a task for the user, which is to determine the minimum support for each item. Furthermore, the difficulty of determining the minimum support value by the user has led to a new field of study, namely association rules without the use of minimum support such as Top-K [5,11,19,32,51] and Skyline [6,10,33,52]. The Top-k association rule does not require the minimum support value because in this method the user is only asked to determine the k value, which is the number of rules that will be generated in the rule formation process. Therefore, it is easier for users to determine the k value because they explicitly know the number of rule results they want to obtain.
The Skyline algorithm was first proposed by Borzsony [52] and later developed by Goyal [33]. It is a point that is not dominated by other points [52]. This algorithm was combined by Jerry Chun-Wei Lin [10] and Jeng-Shyang Pan [6] to produce association rules. Furthermore, this study does not use minimum support but instead makes use of the maximum utility (utilmax) the result of each iteration of the utility list structure.
The association rules require a minimum support value to decrease the number of items used in the rule creation process, according to the results of the previous evaluation [20]. In addition, the presence of this threshold can lower the amount of time and memory required throughout the rule-making process. Therefore, regardless of the term, starting from Minimum Utility, Maximum Utility to Minimum Support, this threshold is required for the association rule process. Meanwhile, processes without the threshold will involve all the items present and require more time and memory, which will lead to rules that may not be as desired.

Literature review automate minimum support
Choosing a minimum support value is one of the most difficult aspects of applying association rules. This is because most methods presume that all database items are comparable and occur at the same frequency. However, this assumption is incorrect because some items appear frequently in the database, compared to others [53].
Furthermore, existing algorithms such as apriori and fpgrowth do not have the ability to determine the minimum support and threshold values, therefore, the user estimates these parameters intuitively. The association rule mining algorithm can generate a large number of rules, thereby causing the algorithm to experience long execution times and large memory consumption and vice versa. However, this is dependent on the threshold choice [9].
Users find it difficult to set minimum support, which led to the creation of Aprioribased mining algorithms, a frequent and attractive itemset. This causes a challenging problem due to the performance of this algorithm is highly dependent on some userdefined threshold. For example, assuming the minimum support value is too large, the database becomes empty. Small minimum support, on the other hand, leads to poor mining performance and a slew of unappealing association rules. As a result, users are being asked to identify the specifics of the database to be mined as well as the suitable threshold in an unreasonable manner. Although the minimum support was explored under the supervision of experienced miners, the results were not in accordance with users' needs [15].
Zhang [15], carried out a study with the main contribution of providing a strategy to convert fuzzy (user-defined) thresholds into actual minimum support. As a result, a strategy capable of recognizing some aspects of the database to be mined is required in order to construct a conversion function. Users must still define the real minimum support that corresponds to the database to be mined when using existing Apriori algorithms. However, without proper knowledge, it is impossible to establish the minimal support that matches to the database. Zhang proposed a computational strategy to overcome the problem of minimum support settings. This strategy differs from the existing Apriori algorithm because it allows users to define their mining requirements in a commonly used mode and automatically converts the specified threshold into actual minimum support.
Trivedi [53] carried out a study on the Semi-Apriori algorithm by integrating the average support threshold. This was followed by checking frequent items to determine the data using an automatically generated support threshold to create the itemset more frequently. This reduces time complexity as well as space complexity.
Dahbi [9] designed a method for determining an appropriate minimum threshold value for effective support. The initial contribution was that instead of using user-defined constant values, this study determined the minimum support (minsup) automatically for each data set. Meanwhile, the second made dynamic adjustments (updates) to this minsup by applying a single, standardized minimum support threshold to each level. However, not all objects in an itemset work in the same way; some were used frequently while others were used infrequently. As a result, the minsup threshold must vary depending on the item level.
Kanimozhi [3], stated that a technique with a suitable automatic support threshold at each level is one of the right choices to overcome the problems associated with the minimum support. Therefore, to achieve this task, a technique that uses the automated support system to generate the appropriate rules without losing the rules of interest was proposed based on a Confidence-Lift Measure. This approach was used to determine the initial minsup value by analyzing the itemset and its frequency. It also proposes a cumulative support threshold at the next level using items considered at the previous and current levels.
Based on previous research, most of the determination value of the minimum support is determined by the user. This becomes a problem when the user does not know the characteristics of the dataset. This causes the rule formation process to be repeated to obtain the appropriate number of rules. In this study, an automatic minimum threshold determination method is proposed based on the characteristics of the dataset, so that the user does not need to determine the value of minimum support at the beginning. In addition, if the minimum threshold value is only determined based on frequency, it is unfair for items that have other advantages, so that in this study, the determination of the minimum threshold value also involves other criteria that influence the formation of the rule. In the adaptive support method, the value of minimum support is not determined by the user. This can overcome the difficulties and problems that exist in the current association rule. In addition, in adaptive support the threshold value is not only based on frequency but also involves certain criteria that can give different weights to each item. The proposed threshold value does not need to be recalculated at each level so that it consumes memory and time more efficiently. The adaptive support method for determining the minimum threshold can be implemented in other association rule algorithms.

The proposed model
To determine the value of minimum support in the association rule, a special method is needed to determine the value according to the characteristics of the dataset which involves certain criteria according to the desires of the user. Therefore, this study developed a method for calculating the minimum support [15], which is similar to the previously used, as shown in Fig. 1.
The adaptive support method comprises 2 types of input, namely transaction datasets and criteria values, which are used to calculate the utility of each item by multiplying their support. This method is also used to determine the frequent itemset, which also involves other predetermined criteria. From the utility results of each item, the average overall utility in the dataset is calculated and the minimum threshold value is obtained, which is further divided by the number of transactions. Furthermore, the algorithm used to determine the value of minimum support based on the characteristics of the database and certain items criteria (utility) is shown in algorithm 1.
In the adaptive support model, the determination the value of minimum support is based on several factors which include:

Characteristics of the dataset
In determining the value of minimum support, the user should know in advance, the characteristics of the dataset that will be processed for rule formation. This is because several characteristic factors of the dataset will affect the suitable minimum support value. Furthermore, for an adaptive support mode, the factor used to deter-mine the value of minimum support is shown from the number of items contained in the transaction, the number of transactions, the average number of items in each transaction, and the support value for each item. The occurrence frequency of items is inadequate to be used as a threshold in order to produce adaptive rules. Therefore, a criterion known as items utility is required as an assessment for an item known as a frequent itemset with the right to be a part of the rule-making process. Furthermore, these criteria can be determined by the user and each item has its own criteria values. For example, a user who wants to get a rule for the most expensive item has an involved criterion such as, the item price. In addition, another example is when a user wants a rule with the biggest profit then, the criterion for the item involved is the profit from each item. Based on both types of inputs, the calculation process is carried out according to algorithm 1 to obtain adaptive support. The calculation stages are as follows: 1. Calculation of the support value for each item in the dataset with the following formula: 2. Calculation of the utility for each item in the dataset with the following formula: 3. Calculation of the average utility for the entire transaction with the following formula: 4. Calculation of the minimum threshold value used for the rule formation process which is the average utility value divided by the total existing transactions through the following formula: 5. Performance of the rule formation process with existing methods through the apriori algorithm, fpgrowth, or other algorithms Where, Sup(d) = support value for an item; n(d) = number of occurrences for an item; |D|= total transaction; |N|= total item; U(d) = utility value for an item; Util(d) = utility and support value for an item; Avesup = Average utility of the item; Minsup = minimum threshold value (item density level). The determination of the minimum threshold value can be calculated automatically based on the characteristics of the dataset from the proposed adaptive support method. Furthermore, the user does not need to determine the value of minimum support at the beginning or repeat the experiment several times to determine the appropriate value of minimum support.

Data experiment
The special dataset for the association rule case used in this study was obtained from SPMF, a Java-based open-source software and data mining library [54]. It is a pattern mining program that is released under the GPL v3 license. SPMF is also linked to 224 data mining algorithms, including itemset mining, sequential pattern rule mining, associated rule mining, sequence prediction, periodic pattern mining, high utility pattern mining, time-series mining, clustering, and classification. Furthermore, it produces larger dataset used to evaluate and compare algorithm performance. Description of the dataset can be seen in Table 1.
The dataset characteristics used can be seen in Table 2.
The study environment used was a laptop with an Intel Core i7-8550U CPU @ 1.80 GHz 1.99GHz, 16 GB of Installed memory (RAM), and a 500GB SSD Hard drive. The tool used for the simulation is using the SPMF [54] software that has been described previously

Result
The proposed adaptive support method was carried out on 8 datasets according to Table 1. In this study, other criteria used to determine the minimum threshold were the same for all items, namely valued 1. Furthermore, the results from all 6 of the 8 datasets obtained the appropriate minimum support value and rules with lift ratio >1. The results of this adaptive support calculation process are shown in Table 3. The trial of the proposed adaptive support method used two basic algorithms in the association rule, namely Apriori and Fpgrowth [10]. The Apriori algorithm used a level-wise approach and generated candidate items for each level [26]. Meanwhile, the fpgrowth algorithm used the pattern growth method but did not generate candidates for each level. After the elimination of itemset with minimum support, the next step was to obtain a Frequent Pattern-Tree [55]. The trial was carried out from the highest minimum support value at 100% and then repeated while a reduction of this value occurred. All minimum support values are not tested on all datasets because the number of rules is different. The more rules that are generated, the more difficult it will be in the decisionmaking process. For example, in the connect dataset with a minimum support value of 70%, it produces a rule 392,469,141 using the fpgrowth algorithm and consumes a lot of resources (runtime and memory) so the test is stopped. Tables 4, 5, 6, 7 shows the test results of each dataset for each minimum support value using the a priori and fpgrowth algorithms. The blue color in the table is the minimum support value proposed by the adaptive support method.     Based on Tables 4-7, it was shown that for each different dataset, the same minimum support value led to a varied number of rules. This was because of the dataset characteristics consisting of the number of items, the average item in each transaction, and the different density values. Furthermore, the trials that have been carried out showed that the higher minimum support value led to a smaller number of rules and runtimes and vice versa. The minimum support value generated by the adaptive rule method when compared with the results of trials using the fpgrowth and apriori algorithms is shown in Fig. 2. Figure 2 shows that 6 out of 8 datasets or about 75% of the total experimental data obtained a minimum support value between the minimum and maximum values for each algorithm. This means that the value of the proposed adaptive support can generate a rule and if viewed from the quality, adaptive support produces a rule that has a lift ratio value > 1. In addition, with the proposed adaptive support value, users no longer need to guess the appropriate minimum support value that can generate rules. So that there is no longer a possibility that with the specified minimum support value, it does not get a rule, or the number of rules is 0 and must repeat the experiment with another minimum support value. From the experiments that have been carried out, it can be seen that this method is suitable for data sets that have a number of items less than 150 items and are not too dense. with the proposed algorithm can save time for execution because there is no need to guess the appropriate minimum support value. Figure 3 shows that the smaller the minimum support value, the more rules are generated. The red line in Fig. 3 shows the proposed minimum support value based on the adaptive support method. Figure 3 shows no significant difference between the apriori and fpgrowth algorithms related to the number of rules generated for each minimum support value. Therefore, this study aims to determine the value of the proposed minimum support using the adaptive support method (red line). The result showed that the proposed minimum support value produces a number of limited rules.

Conclusion and future work
Based on the dataset's characteristics, the minimal support value can be established. The number of transactions, items, and the average number of item frequencies in each transaction are all aspects of the dataset that can be utilized to compute the minimal support value (item density). Datasets with high-density levels require a larger minimum support value compared to those with low-density. Therefore, based on this process, an adaptive support model capable of determining the value of minimum support based on the characteristics of the dataset and certain criteria is proposed. From the experimental results obtained from 8 and 6 datasets, a rule with a lift ratio value > 1 based on the minimum support value is determined. Association rules provide many benefits in real life, one of which is the recommender system. In addition, Fig. 3 Test results of the number of rules for each minimum support value association rules can be implemented in classification, information enhancement, promo design, cross selling design, customer loyalty improvement and segmentation. These rules can also be implemented for decision support and recommendation systems.
The minimum support value is important and has a big influence on the rule formation process in the association rule. The errors obtained when determining this resulting rule were not desired. Furthermore, the determination of this value was difficult especially if the user is ignorant of the dataset characteristics as a factor which was used as a reference in determining the value of minimum support. This is because datasets with a high level of density required a larger minimum support value when compared to datasets with low density. This density level is the average value of the occurrence frequency of each item in the dataset. Therefore, an adaptive support model which determined the minimum support value based on the characteristics of the dataset was proposed. In addition, certain criteria leading to adaptive rules following the desires of the user were included.
From the study results on 8 datasets, 6 datasets obtained a rule based on the minimum support value suggested by the adaptive support model. In the future, the author will integrate the multicriteria system model [56] and this adaptive support method to obtain a completely adaptive rule model [22]. In addition, in future research, the author will conduct experiments using datasets that have utility and will use more diverse evaluation methods.