Association rule
The concept of association rule mining was initially introduced in a research paper by Agrawal [26, 27]. This study developed a focus on different topics ranging from high utility itemset [23, 25, 28,29,30,31], Top-K [5, 11, 32], Skyline [6, 10, 33], Multicriteria [34,35,36] and Meta association rules [37,38,39]. Furthermore, the following is a formal definition of the main concept of association rule mining: I = {i1, i2, . . ., im} is a set of items and D is a set of transactions T, where a set of transactions T is also a set of items, therefore, T ⊆ I. Furthermore, provided that A is a set of items, Transaction T is said to contain A if and only if A ⊆ T. Association rules are from A B, where A ⊆ I, B ⊆ I, and A ∩ B = Ø. In addition, rule A B has support in transaction set D provided that s% of transactions in D contain A ∪ B [4, 22, 40]. This support is shown by the occurrence frequency of items which is calculated by observing the ratio between the frequency of transactions containing itemset A divided by the total transactions. In general, it can be seen in the formula below [37]:
$$Supp(A \to B) = \frac{{|\{ t \in D|A \cup B \subseteq t\} |}}{|D|}$$
(1)
Where, Supp is the support value; A is the antecedent of the rule in the form of itemset; B is consequent in the form of itemset; t is a transaction containing A and B; D is the total transaction.
Confidence is another threshold used in determining the rule apart from the support value. It is the ratio between the number of transactions containing items A and B divided by transactions containing item A for rule AB. The confidence value of rule A B is obtained by the formula [37]:
$$Conf(A \to B) = \frac{{|\{ t \in D|A \cup B \subseteq t\} |}}{{|\{ t \in D|A \subseteq t\} |}}$$
(2)
Where, Conf is Confidence; A is the antecedent of the rule in the form of itemset; B is consequent in the form of itemset; t is a transaction containing A and B; D is the total transaction.
The general framework of the association rule is to extract a rule with a support value for an item that exceeds the minimum support and confidence value. Therefore, this rule exceeds the minimum confidence value specified by the user. In this case, it can be stated that A B is included in the frequent and confidence category (strong rule) provided that Supp(AB) ≥ minsupp and Conf(AB) ≥ minconf respectively.
To evaluate rule, lift ratio can be used [35, 36, 41,42,43,44,45,46,47,48]. Lift ratio is the ratio between the support value of the rule with the antecedent and consequent support value. The higher of lift ratio value, the more interest rule or called the strong rule. A rule is called interest and strong rule if the lift ratio value > 1 because it shows that is a positive correlation between the premise and conclusion of this association rule [36, 48]. The lift ratio value can be calculated by the formula [36, 41,42,43, 47, 48]:
$$Lift(A \to B) = \frac{\sup (A \cup B)}{{\sup (A) \times \sup (B)}} = \frac{conf(A \to B)}{{\sup (B)}}$$
(3)
Where, Lift is Lift Ratio value; A is the antecedent of the rule in the form of itemset; B is consequent in the form of itemset; Supp is the support value; Conf is the confidence value.
Minimum support
The basic concept of the association rule requires the user to specify a minimum support value at the beginning. This value usually applies uniformly to all items, although in reality, different items may have different criteria for assessing them. Therefore, studies about multiple minimum support which state that the support value should vary for different items have emerged [9, 23, 49, 50]. However, the implementation of this system adds a task for the user, which is to determine the minimum support for each item.
Furthermore, the difficulty of determining the minimum support value by the user has led to a new field of study, namely association rules without the use of minimum support such as Top-K [5, 11, 19, 32, 51] and Skyline [6, 10, 33, 52]. The Top-k association rule does not require the minimum support value because in this method the user is only asked to determine the k value, which is the number of rules that will be generated in the rule formation process. Therefore, it is easier for users to determine the k value because they explicitly know the number of rule results they want to obtain.
The Skyline algorithm was first proposed by Borzsony [52] and later developed by Goyal [33]. It is a point that is not dominated by other points [52]. This algorithm was combined by Jerry Chun-Wei Lin [10] and Jeng-Shyang Pan [6] to produce association rules. Furthermore, this study does not use minimum support but instead makes use of the maximum utility (utilmax) the result of each iteration of the utility list structure.
The association rules require a minimum support value to decrease the number of items used in the rule creation process, according to the results of the previous evaluation [20]. In addition, the presence of this threshold can lower the amount of time and memory required throughout the rule-making process. Therefore, regardless of the term, starting from Minimum Utility, Maximum Utility to Minimum Support, this threshold is required for the association rule process. Meanwhile, processes without the threshold will involve all the items present and require more time and memory, which will lead to rules that may not be as desired.
Literature review automate minimum support
Choosing a minimum support value is one of the most difficult aspects of applying association rules. This is because most methods presume that all database items are comparable and occur at the same frequency. However, this assumption is incorrect because some items appear frequently in the database, compared to others [53].
Furthermore, existing algorithms such as apriori and fpgrowth do not have the ability to determine the minimum support and threshold values, therefore, the user estimates these parameters intuitively. The association rule mining algorithm can generate a large number of rules, thereby causing the algorithm to experience long execution times and large memory consumption and vice versa. However, this is dependent on the threshold choice [9].
Users find it difficult to set minimum support, which led to the creation of Apriori-based mining algorithms, a frequent and attractive itemset. This causes a challenging problem due to the performance of this algorithm is highly dependent on some user-defined threshold. For example, assuming the minimum support value is too large, the database becomes empty. Small minimum support, on the other hand, leads to poor mining performance and a slew of unappealing association rules. As a result, users are being asked to identify the specifics of the database to be mined as well as the suitable threshold in an unreasonable manner. Although the minimum support was explored under the supervision of experienced miners, the results were not in accordance with users’ needs [15].
Zhang [15], carried out a study with the main contribution of providing a strategy to convert fuzzy (user-defined) thresholds into actual minimum support. As a result, a strategy capable of recognizing some aspects of the database to be mined is required in order to construct a conversion function. Users must still define the real minimum support that corresponds to the database to be mined when using existing Apriori algorithms. However, without proper knowledge, it is impossible to establish the minimal support that matches to the database. Zhang proposed a computational strategy to overcome the problem of minimum support settings. This strategy differs from the existing Apriori algorithm because it allows users to define their mining requirements in a commonly used mode and automatically converts the specified threshold into actual minimum support.
Trivedi [53] carried out a study on the Semi-Apriori algorithm by integrating the average support threshold. This was followed by checking frequent items to determine the data using an automatically generated support threshold to create the itemset more frequently. This reduces time complexity as well as space complexity.
Dahbi [9] designed a method for determining an appropriate minimum threshold value for effective support. The initial contribution was that instead of using user-defined constant values, this study determined the minimum support (minsup) automatically for each data set. Meanwhile, the second made dynamic adjustments (updates) to this minsup by applying a single, standardized minimum support threshold to each level. However, not all objects in an itemset work in the same way; some were used frequently while others were used infrequently. As a result, the minsup threshold must vary depending on the item level.
Kanimozhi [3], stated that a technique with a suitable automatic support threshold at each level is one of the right choices to overcome the problems associated with the minimum support. Therefore, to achieve this task, a technique that uses the automated support system to generate the appropriate rules without losing the rules of interest was proposed based on a Confidence–Lift Measure. This approach was used to determine the initial minsup value by analyzing the itemset and its frequency. It also proposes a cumulative support threshold at the next level using items considered at the previous and current levels.
Based on previous research, most of the determination value of the minimum support is determined by the user. This becomes a problem when the user does not know the characteristics of the dataset. This causes the rule formation process to be repeated to obtain the appropriate number of rules. In this study, an automatic minimum threshold determination method is proposed based on the characteristics of the dataset, so that the user does not need to determine the value of minimum support at the beginning. In addition, if the minimum threshold value is only determined based on frequency, it is unfair for items that have other advantages, so that in this study, the determination of the minimum threshold value also involves other criteria that influence the formation of the rule. In the adaptive support method, the value of minimum support is not determined by the user. This can overcome the difficulties and problems that exist in the current association rule. In addition, in adaptive support the threshold value is not only based on frequency but also involves certain criteria that can give different weights to each item. The proposed threshold value does not need to be recalculated at each level so that it consumes memory and time more efficiently. The adaptive support method for determining the minimum threshold can be implemented in other association rule algorithms.
The proposed model
To determine the value of minimum support in the association rule, a special method is needed to determine the value according to the characteristics of the dataset which involves certain criteria according to the desires of the user. Therefore, this study developed a method for calculating the minimum support [15], which is similar to the previously used, as shown in Fig. 1.
The adaptive support method comprises 2 types of input, namely transaction datasets and criteria values, which are used to calculate the utility of each item by multiplying their support. This method is also used to determine the frequent itemset, which also involves other predetermined criteria. From the utility results of each item, the average overall utility in the dataset is calculated and the minimum threshold value is obtained, which is further divided by the number of transactions. Furthermore, the algorithm used to determine the value of minimum support based on the characteristics of the database and certain items criteria (utility) is shown in algorithm 1.
In the adaptive support model, the determination the value of minimum support is based on several factors which include:
-
1.
Characteristics of the dataset
In determining the value of minimum support, the user should know in advance, the characteristics of the dataset that will be processed for rule formation. This is because several characteristic factors of the dataset will affect the suitable minimum support value. Furthermore, for an adaptive support mode, the factor used to determine the value of minimum support is shown from the number of items contained in the transaction, the number of transactions, the average number of items in each transaction, and the support value for each item.
-
2.
Specific criteria (utility for each item)
The occurrence frequency of items is inadequate to be used as a threshold in order to produce adaptive rules. Therefore, a criterion known as items utility is required as an assessment for an item known as a frequent itemset with the right to be a part of the rule-making process. Furthermore, these criteria can be determined by the user and each item has its own criteria values. For example, a user who wants to get a rule for the most expensive item has an involved criterion such as, the item price. In addition, another example is when a user wants a rule with the biggest profit then, the criterion for the item involved is the profit from each item.
Based on both types of inputs, the calculation process is carried out according to algorithm 1 to obtain adaptive support. The calculation stages are as follows:
-
1.
Calculation of the support value for each item in the dataset with the following formula:
$$Sup(d) = \frac{n(d)}{{|D|}}$$
(4)
-
2.
Calculation of the utility for each item in the dataset with the following formula:
$$Util(d) = Sup(d) \times U(d)$$
(5)
-
3.
Calculation of the average utility for the entire transaction with the following formula:
$$ave\sup = \frac{{\sum {Util(d)} }}{|N|}$$
(6)
-
4.
Calculation of the minimum threshold value used for the rule formation process which is the average utility value divided by the total existing transactions through the following formula:
$$\min \sup = \frac{ave\sup }{{|D|}}$$
(7)
-
5.
Performance of the rule formation process with existing methods through the apriori algorithm, fpgrowth, or other algorithms
Where, Sup(d) = support value for an item; n(d) = number of occurrences for an item; |D|= total transaction; |N|= total item; U(d) = utility value for an item; Util(d) = utility and support value for an item; Avesup = Average utility of the item; Minsup = minimum threshold value (item density level).
The determination of the minimum threshold value can be calculated automatically based on the characteristics of the dataset from the proposed adaptive support method. Furthermore, the user does not need to determine the value of minimum support at the beginning or repeat the experiment several times to determine the appropriate value of minimum support.