Improving efficiency for discovering business processes containing invisible tasks in non-free choice

Process discovery helps companies automatically discover their existing business processes based on the vast, stored event log. The process discovery algorithms have been developed rapidly to discover several types of relations, i.e., choice relations, non-free choice relations with invisible tasks. Invisible tasks in non-free choice, introduced by α$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha ^{\$ }$$\end{document} method, is a type of relationship that combines the non-free choice and the invisible task. α$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha ^{\$ }$$\end{document} proposed rules of ordering relations of two activities for determining invisible tasks in non-free choice. The event log records sequences of activities, so the rules of α$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha ^{\$ }$$\end{document} check the combination of invisible task within non-free choice. The checking processes are time-consuming and result in high computing times of α$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha ^{\$ }$$\end{document}. This research proposes Graph-based Invisible Task (GIT) method to discover efficiently invisible tasks in non-free choice. GIT method develops sequences of business activities as graphs and determines rules to discover invisible tasks in non-free choice based on relationships of the graphs. The analysis of the graph relationships by rules of GIT is more efficient than the iterative process of checking combined activities by α$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha ^{\$ }$$\end{document}. This research measures the time efficiency of storing the event log and discovering a process model to evaluate GIT algorithm. Graph database gains highest storing computing time of batch event logs; however, this database obtains low storing computing time of streaming event logs. Furthermore, based on an event log with 99 traces, GIT algorithm discovers a process model 42 times faster than α++ and 43 times faster than α$. GIT algorithm can also handle 981 traces, while α++ and α$ has maximum traces at 99 traces. Discovering a process model by GIT algorithm has less time complexity than that by α$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha ^{\$ }$$\end{document}, wherein GIT obtains O(n3)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(n^{3} )$$\end{document} and α$\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha ^{\$ }$$\end{document} obtains O(n4)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(n^{4} )$$\end{document}. Those results of the evaluation show a significant improvement of GIT method in term of time efficiency.


Introduction
The rapid development of information systems leads to a growing amount of information [1] that obtains massive stored data [2,3].In the presence of various types of data, events that are happened in the systems are stored in so-called event log [4].Monitoring business processes from a massive event log brings a challenge related to Big Data.Process mining is a discipline of gathering the event log and processing the log into a process model for monitoring, including capturing anomalies [5,6] or bottlenecks [7].The technique of constructing a process model by process mining is called process discovery.
There are several types of relationships which are formed by process discovery algorithms.The sophisticated relationship is invisible tasks in non-free choice.This condition occurs when choice activities which involve invisible tasks are not a choice-free relation (their execution depends on previous activities).This research presents two business processes to give usage overview of invisible tasks in non-free choice.
A port implements two processes, i.e., handling imported goods and handling exported goods.In handling imported goods, several cranes lift containers on to the truck based on the types of containers.Cranes can put dry containers in the truck directly.On the other hand, there are previous tasks of moving reefer and uncontainer.For cranes to carry out appropriate activities, the process model must have makers for mapping container determination and tasks to put on the truck.Those markers are called non-free choice.Each non-free choice relationship connects an activity with another activity; however, dry containers did not have a specific task before lifting on the truck.Therefore, an invisible task is added to raise a non-free choice relationship for each type of containers.This condition is called invisible tasks in non-free choice (denoted as a black circle that relates to a non-free choice relationship in Fig. 1).
Another example is shown in Fig. 2.An invisible task should be added before finish transaction activity, so a non-free choice that connects payment with e-facture to print tax invoice and a non-free choice that connects direct payment to finish transaction are depicted in a process model of purchasing medical equipment.A black circle with a non-free choice relationship is an implementation of invisible tasks in non-free choice.
The process discovery algorithms have been developed rapidly to discover several types of relations [8][9][10][11][12][13]; however, seldom algorithms concern with invisible tasks in non-free choice.α $ [14] is the development of α algorithm that discovers a new issue, i.e., invisible tasks in non-free choice.α $ stores the dependencies of activities in the form of pairs of activities and collects traces of activities based on the event log and combines α ++ rules and α # rules to construct invisible task in non-free choice.α $ has several rules for ordering relations.The rules check all pairs of activities because there is no information of relations of those activities.The checking process produces high computing time.

Fig. 1 A process model of port container handling
Graph-based process discovery algorithm was developed to minimize checking all pairs of activities for discovering a process model.This algorithm utilizes capabilities of a graph database to store activities and their relationships.Instead of observing pairs of activities, the graph-based process discovery algorithm observes their relationships directly.This method is effective because a relationship can be detected from other relationships.For example, the graph-based algorithm discovers parallel relationships based on occurrences of sequence relationships [15].An invisible task is detected based on parallel relationships.Storing relationships can accelerate the computing time for discovery.
Existing graph-based process discovery algorithms have several drawbacks.First, they did not concern to discover invisible tasks in non-free choice.Secondly, those algorithms do not directly store event logs of the system in a graph database.Those algorithms need to convert the event log to graph database.
According to the difficulty of existing graph-based process discovery algorithms, the main contribution of this research is to propose GIT, the expansion of Graph-based Invisible Task, for discovering invisible tasks in non-free choice by extending rules of existing graph-based algorithms and integrating an Enterprise Resource Planning (ERP) system with a graph database to store the event logs directly in the graph database.
GIT algorithm is evaluated based on the quality of obtained process models and the scalability of algorithm.GIT algorithm is compared with α $ and α ++ algorithms based on fit- ness, precision, generalization, and simplicity [16] to measure the quality of the process models.Purchasing medical equipment processes are used to evaluate the quality of the model.The time complexity and computing time are used to measure the scalability.There are several steps to measuring the computing time.First, a graph database that is used by GIT is compared with other databases, i.e., SQL and MongoDB, to stores streaming event logs and batch event logs.Secondly, the computing time of GIT algorithm is compared with the time of α $ and α ++ based on several event logs with different number of activities.The aim of using several event logs to gain the capacity of processes which can be handled by those algorithms.

Existing methods of process discovery
Nowadays, many methods of process discovery are developed.There are two well-know methods, i.e. α [17] and Heuristic Miner [8].
α algorithm forms all unique sequences of activities into a process model.This algo- rithm is expanded to other algorithms to handle several issues in discovering relationships of activities.α + [18] add rules of α for handling looping activities.Furthermore, α ++ [14] and α # [19] expands rules of α + with different purposes.α ++ focuses to depict non-free choice; however, α # appends additional activities, called invisible prime tasks, in a process model to discover particular situations, i.e. skip, redo and switch.Thereafter, α $ [4] introduces invisible tasks in non-free choice and combines α ++ with α # to over- come that recent issue.α and its expended algorithms do not consider noises, so those algorithms are suitable for noise-free event logs.Heuristic Miner [8] is process discovery algorithm that modifies α by adding a thresh- old to filter activities which will be formed in a process model.There are two aims of adding a threshold.First, the action can simplify an obtained process model based on logs containing huge activities and relationships.Activities which have appearance times less than the threshold will be abandoned, so the obtained process model only discover frequently executed activities.Secondly, the disposal of activities with occurrences less than the threshold can eliminates noises because noises raise rarely.Fodina [11] improves Heuristic Miner to create robust heuristic process discovery algorithm.In addition, Fodina also develops rules to handle invisible prime tasks.
α algorithm, as a pioneer process discovery algorithm, inspires the formation of new methods.HMM-Parallel Tasks [20] and CHMM-Invisible Tasks [21] modifies rules of α in the form of Hidden Markov Models.There are also graph-based α algorithm, such as Graph-based Parallel [15] and Graph-based Invisible Task [22].Other algorithms are developed, such as Inductive Miner [9] and RPST [10].Table 1 describes the existing algorithms which can discover types of relations depicted by the columns.The tick signs explain that the algorithms can discover the related relations depicted by the columns.
According to Table 1, existing graph-based process discovery algorithms had not concern in invisible task in non-free choice.GIT algorithm as proposed algorithm of this research handles this type of relation.

Event log
An event log consists of several cases, which identify the executed processes.Each case has several attributes, such as identification of case (CaseId), name of activities (Activity), the execution time of the activities (Timestamp), and actors who carried out activities (Resource) [23].

Control-flow patterns
The process model consists of two types of relations, namely sequence relations and parallel relations.Sequence relations are used to describe processes that run straight in sequence or sequentially, whereas the parallel relations are used to describe a branched process.There are three types of control-flow patterns [24] to represent parallel relations, namely AND, OR, and XOR [20,25].Table 2 shows examples of relations and differences between control-flow patterns in graph models.

Methodology
This research proposes an event log storage algorithm into a graph-database for direct process discovery without exporting event logs from the ERP database and transforming .csvfile into .mxmlor .xes.The storage algorithm will make the discovery process more efficient than previous methods.Figure 3 shows a comparison of the previous method with the proposed method.The first one shows the steps of process discovery proposed by Aalst [17,26].The event log needs to be extracted from the database in the form of .csv.Then, the .csvfile needs to be converted into .mxmlor .xesby using Disco Tool..Mxml file is supported by ProM 5, and .xesfile is supported by ProM 6.Then, the .mxmlor .xesfile is imported to ProM Tool to conduct process discovery.The second one shows process discovery steps using graph-database Neo4J proposed by Darmawan et al. [27].The event log needs to be extracted from the database in the form of .csv.Then, the .csvfile is imported to Neo4J to conduct process discovery.The last one shows the steps for discovering the process model of the proposed method.By integrating Neo4J and ERP using Laravel, the event log stored in the database ERP is directly processed to execute process discovery.

The integration stage
The integration stage is connecting the ERP system with Neo4J, a graph database platform, in storing an event log of the system directly in the form of a graph database.The used ERP system in this research is built by Laravel.This research uses ClientBuilderfrom the GraphAwareLibrary to start the integration and ClientBuilder to make a connection between the system and the graph database.

Process discovery a. The algorithm of logging stage
The logging stage in a graph-database starts when the user logs into the ERP system until the user logs out.The recording is done using the algorithm in Table 3. First, the algorithm checks the log that has been saved in Neo4J.If the last log is found in the graph-database, the login activity is entered in the last case + 1.However, if the last log is not found, then the login activity is recorded as the first case.The Neo4J con- Fig. 3 The comparison process discovery methods.a The pioneer of process discovery method by Aalst et al. [17,26], b the process discovery using graph-database [27], and c GIT process discovery model Table 3 The algorithm of logging stage nection is made using the GraphAwarelibrary by connecting ClientBuilder from the GraphAware.At that moment, the current case will be included in the session to be used in later stages.b.Constructing sequence relation and invisible task Table 4 is used to discover sequence relations of the event log.GIT stores all variations of activities as nodes in the graph database.Then, GIT creates the sequence relations of those nodes based on activities of the event log (called an event).A sequence relation is discovered between an event and its next event with the same case id.Cypher queries in Table 5 are a part of the GIT method.Based on Table 5, the steps of detecting skip, switch and redo invisible tasks are same.GIT algorithm chooses relations of activities based on its rules and inserts invisible tasks in the middle of those relations.The algorithm is implemented before control-flow pattern discovery.The next step is the discovery of the control-flow patterns, i.e., AND, OR, and XOR, as can be seen in Table 6.Control-flow patterns, that is described in Table 6, are used to determine types of parallel relations.Parallel relations can be divided into three types, which are XOR, AND, and OR.XOR relations only allows one activity to be executed in a process.AND relations allow all activities to be executed.Lastly, OR relations are relations between XOR and AND, and it only allows parallel activities that are not categorized into XOR and AND relations.Each relation is divided into two forms, which are split and join.A split happens when there are multiple activities to be chosen.A join happens when multiple activities that can be chosen in split relations are closing into one or more activities.The last step is the discovery of invisible tasks in non-free choice.Non-free-choice relation connects an activity in one choice relation with another activity in the next choice relation.This relation shows that the next activity of choice relation can-Table 4 The algorithm of GIT for discovering sequence relations where: event : activity in the event log nodeAsact : node of graph database as activity count_event-1: the number of events minus 1 in the event log count_variants_activities: the number of variaties of activities that are arranged into a process model in the graph database event[i] CaseID : CaseID of an i-order event nodeAsact j Name : name of activity in the graph database not be freely chosen and is influenced by the activity of the previous choice relation.Figure 2 is an example of non-free-choice relation and discovery by using GIT.
As depicted in Fig. 2, a node Print Tax Invoice and an additional node, i.e., Invisible Task, cannot be freely chosen even though the relation between Make Payment to Invisible Task and Make Payment to Print Tax Invoice are XORSPLIT; instead, the choice of the node Print Tax Invoice can only be taken if the previous activity taken is a node Payment with E-Facture and choice of the Invisible Task can only be taken if the previous activity taken is a node Direct Payment.Non-free-choice relation is represented by NONFREECHOICE relation.Using the algorithm in Table 7, GIT algorithm finds some nodes with outgoing relation of XORJOIN, some nodes with ingoing relation of XORSPLIT, detect invisible task, and lastly match them against some nodes inside activity label which has the same name.

Results
This research evaluates GIT algorithm based on two aspects, which are the quality of obtained model and the scalability.The quality is measured based on fitness [28], precision [29], generalization, simplicity.The fitness value is obtained by counting the number Table 5 The algorithm of GIT for invisible prime task discovery where: x: node of graph database as invisible task  of cases described in the model divided by the total cases in the event log.The precision value is obtained by counting the number of traces discovered in the model divided by the number of traces present in the event log.Generalization and simplicity are measured based on equations in [16].
The scalability is measured according to computing time and maximum number of traces algorithm could handle.This research divides computing time as computing time of storing event logs and computing time of discovering process models.A graph database as the chosen database of GIT is compared with SQL and MongoDB to measure computing time of storing logs.The discovery computing time of GIT is compared with those times of other algorithms: α $ and α ++ .
There are four event logs which are used in the evaluation.The first event log is generated from a medical store that already implemented Enterprise Resource Planning (ERP) system.The reason of choosing this event log is the processes are executed in the ERP system which is integrated with graph-based platform, i.e., Neo4j by using integration methods (explained in "The integration stage" section).Thereupon, the event log is noise-free.GIT algorithm refers to rules of α algorithm, so this algorithm is suitable with noise-free logs.The other event logs are obtained from BPI Challenge [30][31][32].Those event logs are chosen to measure scalability because they consist of high number of traces and cases.
The result of GIT algorithm to discover processes in the first event log is shown in Fig. 4. The full name of activities presented in Fig. 4 is shown in Table 8.In Fig. 4, Table 7 The algorithm of GIT for discovering invisible task in non-free choice the non-free choice relations occur between activity Direct Payment and an invisible task and between activity Payment with E-facture and Print Tax Invoice.Thus, this research is verified to be able to detect invisible task in non-free choice.
Figure 5 shows the comparison of results from the previous methods with the method proposed in this study.All fitness and simplicity values of all algorithms are 1.000.α ++ obtains lowest precision value, at 0.8644.On the other hand, α ++ gains highest generalization value, at 0.9682.Both of α $ and GIT has same values for preci- sion and generalization, accounting for 1.000 and 0.9681, respectively.
In addition to comparing the quality of obtained results, the time complexity of the algorithms is measured.The time complexity of α $ is O(n 4 ) [33], while the time com- plexity of GIT is O(n 3 ), with the explanation steps as listed in Table 9.
Tables 10 and 11 show the result of storing computing time and discover computing time based on six event logs.Those event logs are three event logs from BPI Challenge, processes of a medical store, and chosen processes of the medical store.This research calculates storing computing time by two types of event logs, i.e., batch event logs and streaming event logs.MySQL obtains lower computing times in batch event logs with average 1.092 s and streaming event logs with 0.0068 s.On the other hand, graph database has highest storing computing time in batch event logs, at 6071.5 s and has low storing computing time, at 0.0730 s.Contrary to graph database, Based on Table 11, GIT can handle more numbers of event logs than α $ and α ++ .The highest number traces which can be discovered by GIT algorithm is 981.On the other hand, α $ and α ++ can discovery processes with maximum number 99 traces.

Quality of obtained process model
The quality result of obtained process models by GIT, α ++ and α $ is presented in Fig. 5. α ++ has the lowest precision because this algorithm cannot describe non-free choice relationships, so several traces generated by its process model are not stored in the event log.Those traces decrease the precision value of α ++ .GIT and α $ has high precision because those algorithms can depict invisible tasks in non-free choice.Generalization values of GIT and α $ are smaller than α ++ because invisible tasks.

Steps Complexity
Storing and creating sequence relations of activities (Table 4) O n 2 Creating invisible prime tasks (Table 5) Creating parallel relations, i.e., XOR, OR, and AND (Table 6) O n 3 Creating parallel relations, i.e., XOR, OR, and AND (Table 7) Table 10 Result of storing computing time where: UC1: an event log of processes starting user login activity to open selling activity in medical store UC2: an event log of processes starting user login activity to open purchasing activity in medical store UC3: an event log of domestic declarations [32] UC4: an event log of whole processes in medical store UC5: an event log of hospital log [31] UC6: BPI challenge 2012 [30] Type of event log

Used cases Total case Total traces Total events
Computing time of storing (s)

Scalability
This research measures the scalability based on storing processes and discovery processes.The results of Table 10 shows graph database can handle huge event logs; however, the consuming time is fantastic.In batch event logs, the storing computing time by graph database is 6000 times of MySQL and 59 times of MongoDB.The long storing computing time by graph database will resulting in a spike in overall GIT runtime.Optimizing storing computing time by graph database is an important thing that must be studied further in the future.Table 11 shows the discovery computing time and also maximum traces algorithm could discover.GIT algorithm has highest number of traces that can be handled among other algorithms; however, the limit traces of GIT is only 981 traces.The main reason of GIT inability to discover more than 900 traces because GIT determine sequence relationships based on all cases, not all traces.In the future work, clustering cases to obtain traces is needed before discovery processes are executed.

Conclusion
This research proposes an algorithm called GIT algorithm, which utilizes a graph database to model invisible tasks in non-free choice efficiently.The capability of storing relations in a graph database can simplify the rules of process discovery because several relations of a process model can be constructed based on other discovered relations.GIT algorithm improves Graph-based Invisible Task by combining non-free choice rules with invisible task rules from Graph-based Invisible Task.
This research compares GIT algorithm to α $ and α ++ based on the qualities of their models.The chosen quality measurements are fitness, precision, generalization and simplicity.The experiments show that GIT and α $ has highest values of fitness and preci- sion, which are 1.This research calculates computing time and time complexity of GIT and α $ and computing time of graph database, MySQL, and MongoDB.Graph database gains highest storing computing time of batch event logs; however, this database obtains low storing computing time of streaming event logs.The complexity of α $ is O(n 4 ) ; meanwhile, the complexity of GIT is O(n 3 ) .Based on 99 traces, GIT algorithm discov- ers a process model 42 times faster than α ++ and 43 times faster than α $ .GIT algorithm can also handle 981 traces, while α ++ and α $ has maximum traces at 99 traces.It can be concluded that the process discovery of GIT algorithm is more efficient than that of α $ algorithm, and the graph database is suitable on the streaming event log.The future work of this research is optimizing storing processes in the graph database and clustering processes as the input of discovery processes.Those subjects should be analyzed further to increase the scalability of GIT algorithm.

[
:SPLIT]: a split relation activities (XORSplit, ORSplit or ANDSplit) [:JOIN]: a join relation of activities (XORJoin, ORJoin or ANDJoin) [:INVISIBLE]: a temporary relation to determine the need for invisible tasks size(): the number of occurrences of the specified relation in the graph database of the graph database as activity a, b: other node in the model path: all path from b to a and connected by n on case[j]

Fig. 4
Fig.4 The process model by GIT algorithm

Fig. 5
Fig. 5 Result comparison between the previous methods and the proposed method

Table 1
Algorithms of process discovery

Table 2
Control-flow patterns in process model

Table 6
The algorithm of GIT for discovering control-flow patterns a, b: other node in the model nodesAfter: all nodes after node n nodesBefore: all nodes before node n

Table 8
List of activity names initialized in Fig.4

Table 11
Result of discovering computing time