In this section, we present and analyze our parallel version of the EA using MapReduce framework to extract a set of short distinguishing sequences from complete observable nondeterministic FSM. Our method outperforms the previous parallel approaches in the sense that it efficiently provides a short distinguishing sequence for each pair of states of a large FSM.
Framework overview
The proposed solution consists of two MapReduce steps namely: the intersection step and the derivation of short distinguishing sequences step or the derivation step for short.
Figure 3 illustrates the workflow of our solution, which receives a large FSM as input and produces a set of short distinguishing sequences for each pair of states using two MapReduce algorithms. Initially, an input FSM \(M=(S,I,O,E)\) is preprocessed to produce a text file where every line (value) represents a transition \(t =(s[t], I[t], O[t], d[t])\). This file is the input of the intersection step. MapReduce framework is composed essentially of a map function which performs filtering and sorting, and a reduce function which performs a summary operation. The map function produces a set \(\langle key,value\rangle \) pairs. In our case, for a transition t, it generates a set of associated keys. Following that, the set of pairs produced by map function are grouped and sent to reduce function. This last receives transitions having the same key and computes their intersection i.e. for two transitions \(t_i\) and \(t_j\) having the same input/output, the result will be an edge in the socalled truncated successor tree \((\{(s[t_i],s[t_j])\}, I[t_i], \{(d[t_i],d[t_j])\})\), else \((\{(s[t_i],s[t_j])\}, I[t_i], \{ \})\). In the second step, we developed an iterative MapReduce algorithm to derive a short distinguishing sequence if exists. The input of this step is the output of the previous intersection step. The map function of this step takes an edge of the truncated successor tree \((n_s, l, n_d)\) where \(n_s\) denotes the source node, l is the label, and \(n_d\) is the destination node and produces a set of associated keys. For a given edge, if its destination node is empty, the associated key will be its source node, else each states’ pair in its destination node becomes a new associated key. Next, the reduce function receives the set of edges having the same key and divides the set of associated edges into two subsets, the first one contains all edges having an empty destination node, the other one contains the rest of the edges. Then, the cartesian product of these two subsets will be performed i.e. for two edges e and \(e'\), if the source node of \(e'\) is a subset of the set of pairs in the destination node e, then the resulting edge will be \((n_s (e), l(e)l(e'), n_d(e)\setminus n_s(e'))\). This process is equivalent to a BFS in the truncated successor tree of the EA, when we concatenate edges label to construct DSs if exist. Figure 4 shows the derivation step of the successor tree presented in Fig. 2. The first round of the derivation step performs the successor tree received from the intersection step and maps its edges to different reducers following the previously described mapping schema. The first round of the derivation step performs the successor tree received from the intersection step and maps its edges to different reducers following some mapping schema. For example, from Fig. 4, the pair \((s_2,s_3)\) in the node \(n_3\) maps all edges having the forms \(((s_2,s_3),*,\bot )\), \(((s_2,s_3),*,\{\})\) or \((*,*,(s_2,s_3))\) to a given reducer. Then, the cartesian product between edges having the form \((*,*,(s_2,s_3))\) and edges having the form \(((s_2,s_3),*,\{\})\) or \(((s_2,s_3),*,\bot )\) is computed inside this reducer to produce compact edges having the from \((*,**,\{\})\) or \((*,**,\bot )\). So, the reducer mapped by \((s_2,s_3)\) produces the edge \(((s_0,s_1),ba,\{\})\) from two edges \(((s_0,s_1),b,(s_2,s_3))\) and \(((s_2,s_3),a,\{\})\). Then, this edge will be mapped, in the second round, to the reducer associated with the key \((s_0,s_1)\). The same process will be repeated iteratively in the next MapReduce rounds of the derivation step until the stop condition is true.
Finally, if a DS not exists and the number of iterations is less than the maximum bound, we repeat the derivation step by considering the output as an input of the next iteration, else the final output is the set of pairs and their short DSs if they exist.
In the next section, we present the communication cost in a MapReduce framework of this problem using Afrati et al. model [58].
Communication cost analysis
The communication cost model introduced by Afrati et al. [58] gives a good way to analyze problems and optimizes the performance of any distributed computing environment by explicitly studying an inherent tradeoff between communication cost and parallelism degree. By applying this model in a MapReduce framework, we can determine the best algorithm for a problem by analyzing the tradeoff between reducer size and communication cost in a single round of MapReduce computation. There are two parameters that represent the tradeoff involved in designing a good MapReduce algorithm: the first one is the reducer size, denoted by q, which represents the size of the largest list of values associated with a key that a reducer can receive. The second parameter is the amount of communication between the map step and the reduce step. The communication cost, denoted by r, is defined as the average number of keyvalue pairs that the mappers create from each input.
Formally, suppose that we have p reducers and \(q_i \le q\) inputs are assigned to the \(i^{th}\) reducer. Let In be the total number of different inputs, then the replication rate is given by the expression \(r= \sum \nolimits _{i=1}^{p}q_i/In\) [58].
From [58], we compute a lower bound on the replication rate for the intersection of FSMs as a function of q using the following expression:
$$\begin{aligned} r\ge \frac{q \times Out}{g(q) \times In} \end{aligned}$$
where In denotes the input size, Out denotes the output size, and g(q) the number of outputs that can be produced by a reducer of size q. Since, we consider a complete observable nondeterministic FSM, we have \(In= E\) and \(Out=\frac{n(n1)}{2}\times I\). Thus
Proposition 1
The lower bound on the replication rate is
$$\begin{aligned} r\ge \frac{n(n1)}{2}\times \frac{q\times I}{g(q)\times E} \end{aligned}$$
It is worth noting that limiting the reducer size enables more parallelism. Small reducers’ size forces us to redefine the notion of a key in order to allow more, smaller reducers, and thus allow more parallelism using the available nodes.
MapReduce algorithm for the intersection step
Let us present the MapReduce implementation of the intersection step using a modified version of the algorithms proposed in [49]. Notice that our approach produces a truncated successor tree, also called successor table, for all pairs of states of a complete observable nondeterministic FSM. The conducted experiments in [62] show that when deriving distinguishing sequences, the construction time of the successor tree takes 96% of the whole EA’s time. That is why three methods will be presented later in this section for the construction of the truncated successor tree.
The Algorithm 2 below contains the definitions of the map and reduce functions of the intersection step. The map function produces a set of keys based on a defined schema from the input FSM transitions. The reduce function performs, inside reducers, the intersection of the received transitions from the mapper tasks.
The three proposed mapping schema emit a transition to a set of reducers w.r.t. a key defined from some hash functions. Our mapping methods are based respectively: on states, input alphabet symbols, and both states and input alphabet symbols.
Formally, lets \(M=(S,I,O,E)\) be a complete observable nondeterministic FSM having n states and \(t=(s,i, o,d)\) be a transition in E. A mapper produces a set of keys from the transition t based on some hash function h. This hash function is integrated in the definition of these keys as a part of the subfunction getKeysFromTransition() in Line 3 of Algorithm 2. Let us explain in more details the three mapping methods by designing the subfunction getKeysFromTransition().
Mapping based on states
In the first mapping method, from a transition \(t\in E\), the mappers produce a set of keyvalue pairs having the form \(\langle key, t\rangle \), where \(key = \langle h_S(s[t]), s\rangle \), for all \(s\in S\) such that \(s[t] \ne s\) and \(h_S\) be a hash function defined from S to \(\{1,\cdots , n\}\). In this case, we have \(\frac{n(n1)}{2}\) reducers.
In this method, the function g(q), which is the number of outputs that can be produced by a reducer of size q, can be affected by the presence of transitions with different alphabet symbols inside the same reducer. Formally, since we consider a complete observable nondeterministic FSM, one has \(q \le 2 \times (I+I \times O)\) and \(g(q)=I\). Thus, the following proposition gives the upper bound on the replication rate for this method.
Proposition 2
The replication rate r in the statebased mapping scheme is \(r\le (n1)\).
Mapping based on input alphabets
In the second method, we have one reducer for each of the input alphabets. Thus, the number of reducers is equal to the input alphabet size I. The mappers will send each transition t to the reducer corresponding to its input symbol I[t]. More precisely, from a transition \(t\in E\), the mappers produce a set of keyvalue pairs having the form \(\langle key, t\rangle \), where \(key = h_{In}(I[t])\) such that \(h_{In}\) be a hash function defined from I to \(\{1,\cdots , I\}\). We will now have \(g(q) = \frac{n(n1)}{2}\), where \(q \le n + n \times O\). Assuming that the alphabet symbols are uniformly distributed, we have
Proposition 3
The replication rate in the input alphabets based mapping scheme is optimal and equal to 1.
Mapping based on both states and input alphabets
In the last method, we propose a hybrid mapping between first and second method. In other words, keys will be based on the states and input alphabets in the same time. Then, we consider the key form \(key=(s[t],s,I[t])\), where \(s\in S\) such that \(s\ne s[t]\). The number of reducers, in this case, is equal to \(\frac{n(n1)}{2} \times I\), the reducer size \(q \le 2 \times O\), and each reducer will produce no more than one edge of the truncated successor tree. Thus, we can deduce an upper bound of the replication rate in the following proposition.
Proposition 4
The replication rate in the hybrid mapping method is \(r \le (n1)\).
Theorem 1
Algorithm 2 correctly computes the successor tree for all pairs of states of an FSM \(M = (S,I,O,E)\) in single MapReduce round using at least \(\frac{n(n1)}{2}\times \frac{q\times I}{g(q)\times E}\) communication, where \(n=S\).
Proof
In the map function of Algorithm 2, getTransitionFrom(t) returns the set of keys associated with the transition t w.r.t. a given mapping method. Then, it sends the transition t to all reducers indexed by these keys. In order to ensure that the algorithm correctly constructs the successor tree, it is necessary to have the property (*): all the transitions with the same input and output symbols inside the same reducer. Then, the reducer function computes their pairwise intersection to extend the successor tree. Using the proposed mapping methods, we have:

for the mapping based on states, a reducer \( \langle s_i, s_j \rangle \) receives from the mappers all the transitions starting from the state \(s_i\) or \(s_j\); as a consequence, all outgoing transitions from state \(s_i\) or state \(s_j\) are inside this reducer. Then the property (*) is verified using this mapping method.

for the mapping based on input alphabets, a reducer receives from the mappers transitions having the same input symbol c; Then inside a reducer, we have also all transitions having the same input and output symbols.

for the hybrid mapping, a reducer \( \langle s_i, s_j, c \rangle \) receives from the mappers transitions having the same input symbol c and starting from the state \(s_i\) or \(s_j\); then the property (*) is obviously verified.
The Proof for the communication complexity follows from Proposition 1. \(\square \)
MapReduce algorithm for the derivation step
In this step, multiple MapReduce rounds are used to derive a set of shortest distinguishing sequences for each pair of states of an observable nondeterministic FSM. In each round, the mappers run in parallel and produce a collection of pairs \(\langle key, edge\rangle \), while the reducers trait this collection and derive a set of short DSs if they exist.
In this step, we derive a shortest DS for each pair of states. It received from the intersection step \(\frac{n(n1)}{2} \times I\) truncated successor tree edges and produces \(\frac{n(n1)}{2}\) pairs of states with their DS if exists and if not we mention the “not found” notation. To that end, we use a single mapping method based on states. Each map function takes as input an edge e from the truncated successor tree and produces a set of \(\langle key, e\rangle \) pairs. A key key is the pair of states in the source node \(n_s(e)\) if the destination node \(n_d(e)\) is empty; otherwise, it is the set of state pairs in the destination node of the edge e.
Let us compute the replication rate in each MapReduce round of the derivation step based on Algorithm 3. We have \(\frac{n(n1)}{2}\) available reducers and each reducer cannot contain more than \(q \le (\frac{n(n1)}{2} )\times I\) edges. The number of outputs that can be produced is \(g(q)= 1\) in the last iteration, then the following proposition holds.
Proposition 5
The replication rate r in each MapReduce round of the derivation step is \( r \le \frac{n(n1)}{2} \)
Theorem 2
Algorithm 3 correctly derives a short DS, if it exists, for all pair of states of an FSM \(M = (S,I,O,E)\) in maximum mn MapReduce rounds, where \(m=E\) and \(n=S\).
Proof
It is obvious to see that a DS is the label of a path from the root node to a leaf node indexed by the empty set \(\{\}\) in the successor tree. During each MapReduce round of Algorithm 3, the successor tree Tree is compacted by level until the stop condition. Without loss of generality, let us consider a leaf node \(ne_k\) indexed by the empty set which is located at the kth level in the successor tree, and let prec(n) be the set of predecessor nodes of the node n. In each MapReduce round, the node \(ne_k\) replaces all nodes located at the level \(k1\), belonging to the set \(prec(ne_k) = \{ne^1_{k1},\cdots , ne^l_{k1}\}\). As a consequence the successor tree is compacted in the following way: a label x of an edge \((ne^i_{k1},x, ne_k)\) is concatenated with all labels of the set of edges \(\{(n,y,ne^i_{k1}~~n\in prec(ne^i_{k1})\}\), for all \(1\le i\le l\), to produce the set of edges \(\{(n,yx,ne_k~~n\in prec(ne^1_{k1})\}\). The number of MapReduce rounds in Algorithm 3 is related to the stop condition which is true when the root node of the successor tree is reached and a set of DSs is derived, if exists, in less than mn iterations. \(\square \)