Composition of weighted finite transducers in MapReduce

Weighted finite-state transducers have been shown to be a general and efficient representation in many applications such as text and speech processing, computational biology, and machine learning. The composition of weighted finite-state transducers constitutes a fundamental and common operation between these applications. The NP-hardness of the composition computation problem presents a challenge that leads us to devise efficient algorithms on a large scale when considering more than two transducers. This paper describes a parallel computation of weighted finite transducers composition in MapReduce framework. To the best of our knowledge, this paper is the first to tackle this task using MapReduce methods. First, we analyze the communication cost of this problem using Afrati et al. model. Then, we propose three MapReduce methods based respectively on input alphabet mapping, state mapping, and hybrid mapping. Finally, intensive experiments on a wide range of weighted finite-state transducers are conducted to compare the proposed methods and show their efficiency for large-scale data.


Introduction
Weighted finite-state transducer (WFST) has been used in a wide range of applications such as digital image processing [1], speech recognition [2], large-scale statistical machine translation [3], cryptography [4], recently in computational biology [5] where pairwise rational kernels are computed for metabolic network prediction and many other applications [6][7][8].Weighted finite-state transducers are finite-state machines in which each transition in addition to its input symbol is augmented with an output symbol from a possibly new alphabet, and carries some weight element of a semiring.Transducers can be used to define a mapping between two languages.In some cases, The weight represents the uncertainty or the variability of information.For example, weighted transducers introduced in speech recognition [9] are used to assign different pronunciations to the same word with different ranks or probabilities.In classification and learning, kernel methods, like support vector machines, are widely used [10].In [11], Mohri et al. have introduced the theory of rational kernel.The computation of a rational kernel can be done efficiently using a fast algorithm for the composition of weighted transducers.Computing the composition of WFSTs is basically based on the standard composition of unweighted finite-state transducers.It takes as input, two or more WFSTs (T i ) 1≤i≤n , and outputs the composed WFST T realizing the composition of all input WFSTs, such that the input alphabet of T i+1 coincides with the output alpha- bet of T i .The time complexity for computing this operation is shown to be O n i=1 |T i | where |T i | represents the number of states in T i [11,12] (i.e. if there are n input WFSTs, each having m states, the resulting WFST can reach the exponential bound of m n states).The complexity issue of the composition computation leads us to devise efficient methods in large scale.
In this work, we tackle the problem of the composition of WFSTs in the MapReduce framework, which is introduced by Google as a simple parallel programming model [13].
MapReduce is considered as a single Instruction Multiple Data (SIMD) architecture [14] that can be easily implemented over the Hadoop Apache framework [15].We also analyze the cost model for this problem following the optimization approach introduced by Afrati et al. [16].The cost model includes the communication cost which is the amount of data transmitted during MapReduce computations and the replication rate which represents the number of key-value pairs generated by all the mapper functions, divided by the number of inputs.A growing number of papers deal with MapReduce algorithms for various problems [17][18][19][20].Recently, Grahne et al. [21,22] have implemented efficiently the intersection and the minimization operations of finite automata in MapReduce.To the best of our knowledge, this paper is the first approach for computing the composition of WFSTs in MapReduce.We propose three methods to perform this operation in large scale respectively based on states mapping, input alphabet mapping, and a hybrid input and output alphabets mapping.
The remainder of the paper is structured as follows."MapReduce framework" section ncludes necessary technical definitions."The composition of WFSTs in MapReduce" section is a reminder of the fundamental of MapReduce framework."Methods" section presents a cost model analysis of the composition operation in MapReduce using Afrati et al. model."Results and discussion" section presents and analyses three MapReduce methods that compute the composition of many WFSTs.Some comparative and extensive experiments are also conducted in this section to show the efficiency of our methods in large scale."Conclusion" section concludes the paper.
Weighted finite-state transducers.Let A and B be two alphabets.WFST Sometimes called K-transducers over A * × B * are transducers endowed with weights in the semir- ing K .The transition labels of a WFST are in (A ∪ {ε}) × (B ∪ {ε}) ( ε is the empty word).
In this work, we restrict the transition labels to be in A × B .Formally, a WFST T is an 8-tuple (A, B, Q, I, F , E, , ρ) where: A is the finite input alphabet of the transducer; B is the finite output alphabet; Q is a finite set of states; . The weight function w can also be extended to paths by defining the weight of a path as the ⊗-product over the semiring K of the weights of its con- stituent transitions: Q , we denote by P(q, q ′ ) the set of paths from q to q ′ and by P(q, x, y, q ′ ) the set of paths from q to q ′ with input label x ∈ A * and output label y ∈ B * .These definitions can be extended to subsets R, R ′ ⊆ Q , by: P(R, x, y, R ′ ) = q∈R,q ′ ∈R ′ P(q, x, y, q ′ ) .A WFST T is regulated if the out- put weight associated by T to any pair of input-output string (x, y) by: is well-defined and in K .We have [T ](x, y) = 0 when P(I, x, y, F ) = ∅.
Composition of weighted finite transducers.Composition is a fundamental operation used to create complex weighted transducers from simpler ones.Let K be a com- mutative semiring and let T 1 and T 2 be two WFSTs such that the input alphabet of T 2 coincides with the output alphabet of T 1 .Assume that the infinite sum z T 1 (x, z) × T 2 (z, y) is well-defined and in K for all (x, y) ∈ A * × B * .Then, the com- position of T 1 and T 2 produce a WFST denoted by T 1 • T 2 and defined for all x, y by [11]: There exists efficient composition algorithm for WFSTs [26].States in the composition T 1 • T 2 of two WFSTs T 1 and T 2 are identified with pairs of a state of T 1 and a state of T 2 .Its initial state is the pair of the initial states of T 1 and T 2 .Its final states are pairs of a final state of T 1 and a final state of T 2 .The following rule specifies how to derive a transition of T 1 • T 2 from appropriate transitions of T 1 and T 2 : In the worst case, all transitions of T 1 leaving a state q match all those of T 2 leaving state q ′ , thus the space and time complexity of composition is quadratic: O(|T 1 ||T 2 |) .See [9] for a detailed presentation of the algorithm.The following Fig. 1 illustrates WFSTs composition.
In a general case, when considering the composition of many WFSTs, noted (T i ) 1≤i≤n , the space and time complexity of composition is O n i=1 |T i | .In this work, we propose a simple and efficient parallel methods to compute the composition of many WFSTs in MapReduce framework.

MapReduce framework
Big Data is a large and heterogeneous collection of datasets which makes it difficult to process using traditional data processing tools.Nowadays, the collected datasets come mostly from social networks and scientific applications.To overcome the computational and storage of Big data challenges, various solutions have been successfully proposed.Among the popular approaches, there is the famous Hadoop MapReduce framework [15].
In this section, we will focus on how distributed computing program works over the Hadoop MapReduce Model.First, the Hadoop framework and the MapReduce programming model will be briefly presented.Then, we describe how this system processes units of data key, value in parallel MapReduce approach.

Hadoop framework
The Apache Hadoop is one of the most popular open source framework for the clustered environment that allows reliable, scalable, and distributed storage.It also helps with the processing of large datasets through the simple programming models.It manages computer clusters built from single to thousands of machines, each offering local computation and storage.The failure of a node in the cluster is automatically managed by re-assigning its task to another node [15].
The project Apache Hadoop includes four fundamental modules: Hadoop Common or Hadoop Core, HDFS, YARN and Hadoop MapReduce.The following Fig. 2 schematizes these components.
• Hadoop Common or Hadoop Core provides essential services and basic processes such as abstraction of the underlying operating system and its file system.It also contains the necessary Java libraries and scripts required to start Hadoop.The Hadoop Common package also provides source code and documentation, as well as a contribution section that includes different projects from the Hadoop Community [15].• Hadoop Distributed File System (HDFS) is a distributed file system developed by Apache Hadoop.It ensures high-throughput storage and access to application data on the community machines thus providing very high aggregate bandwidth across the cluster [15], high fault tolerance and native support of large data sets [27]. (1) (2) q a , q a q b , q b q b , q a q c , q b q d , q c a|c, 0.03 c|c, 0.06 c|c, 0.12 b|c, 0.06 a|b, 0.16 Fig. 1 (1) Weighted transducer T 1 and (2) weighted transducer T 2 over the tropical semiring.(3) Composition of T 1 and T 2 • Hadoop Yet Another Resource Negotiator (YARN) is a platform to manage cluster resources and schedule tasks.It was added in the Hadoop 2.0 version to increase the capabilities by solving the limit of 4000 nodes and Hadoop's inability to perform finegrained resource sharing between multiple computation frameworks [28].• Hadoop MapReduce is an implementation of the MapReduce programming model based on YARN system for parallel processing of large data [29].

The MapReduce programming model
MapReduce is a programming model proposed by Google in 2004 [13] that provides parallel processing of large-scale data.It is easy to use and expresses a large variety of problems as MapReduce computation in a flexible way, which simplifies the data processing in large scale [13].MapReduce programming model is a system to process the basic unit of information in a key, value pair, where key and value are two objects.This computational model has three principal steps: Map, Shuffle and Reduce as schematized in Fig. 3.
During the map step, the model reads each key, value pair from a given input files.Then the Mapper operates on one pair at a time by calling the map function defined by the user, produces as output a finite multi-set of new key, value pairs, and determines new pair's sets through a hash function.This allows different machines to process the inputs of a different map in an easy parallel way.The shuffle step occurs automatically, it is done by Hadoop to manage the exchange of the intermediate data from the map task to the reduce task.One can be divide this step into three phases.The sort phase produces the set of intermediate keys received from the buffered mapper in a particular order.It assists the reducer to know that a new reduce task should start when the next key in the sorted input data is different from the previous one.The merge phase group all intermediates input values having the same key key in one list and create the pair (key, list of value ) .The partitioner phase determines in which reducer a pair (key, list of value ) will be sent.It is based on a hash function that associates and sent a pair to a reducer.In the last step, the reducers that receive the sorted (key, list of value ) pairs can be executed simultaneously while operat- ing on different keys, they reduce a set of intermediate values which share a key to a new smaller set of values.In our problem, each reducer emits zero, one or multiple outputs for each input (key, list of value ) pairs.

The composition of WFSTs in MapReduce
In this section, we present three methods to perform WFSTs composition using MapReduce framework.We also study the communication cost based on Afrati et al. model [16] and analyze the replication rate for combining WFSTs.

Generic MapReduce algorithm for the composition of WFSTs
In the following, we present a general approach to perform the composition of WFSTs using a single round of MapReduce.Furthermore, we define and detail respectively the map and reduce functions.
The preprocessing phase in our algorithm store in a text file all transitions of WFSTs (T i ) 1≤i≤n , each having m states.A transition t from a WFST T i will be represented as a 4-tuple as follows : (t,type(s[t]),type(d[t]),index(t)), where type() is a function that maps a state to an element of the set {i (initial), f (final), if ( initial and final), s (simple)} and index(t) = i gives the WFST order index in the composition having the transition t.  3 The principal steps of MapReduce [30] input : key, value pair, where key represents an arbitrary instance identifier and value is the 4-tuple form associated with the transition t. output: collection of k, t pairs, where k be an associated key with the transition t.
// create the transition t from the input value The Reduce function performs the composition of different transitions lists produced from the Shuffle step as follows: One transition is selected from each WFST such that the input symbol of WFST T i+1 coincides with the output symbol of the previous WFST T i .Moreover, the Reduce function group the transitions w.r.t.their WFST index in a list of sets (line 2 of Algorithm 2), and compute the Cartesian product of those sets (line 4 of Algorithm 2).
In the following, we will discuss the communication cost of the proposed algorithm according to Afrati et al. model [16].

The communication cost model
The communication cost model introduced by Afrati et al. [16] is powerful and simple.This model gives a good way to analyze problems and optimize the performance on any distributed computing environment by explicitly studying an inherent trade-off between communication cost and parallelism degree.By applying this model in a MapReduce framework, we can determine the relevant algorithm for a problem by analyzing the Trade-off between reducer size and communication cost in a single round of MapReduce computation.There are two parameters that represent the trade-off involved in designing a good MapReduce algorithm: the first one is the reducer size, denoted by q, which represents the size of the largest list of values list of value associated with a key key that a reducer can receive.The global cost is the sum of the computation costs over each reducer processing all its associated values.The second parameter is the amount of communication between the map step and the reduce step.The communication cost, denoted by r, is defined as the average number of key-value pairs that the mappers create from each input.Formally, suppose that we have p reducers and q i ≤ q inputs are assigned to the ith reducer.Let |I| be the total number of different inputs, then the repli- cation rate [16] is given by the expression r = p i=1 q i /|I|.Notice that limiting reducer size enables more parallelism.Small reducers size force us to redefine the notion of a key in order to allow more, smaller reducers and thus allow more parallelism with available nodes.

Lower bound on the replication rate
The replication rate is intended to model the communication cost, which is the total amount of information sent from the mappers to the reducers.The trade-off between reducer size q and replication rate r, is expressed through a function f, such that r = f (q) .The first task in designing a good MapReduce algorithm for a problem is to determine the function f, which gives us a lower bound of the replication rate r [16].
Let us now derive a tight upper bound, namely g(q), on the number of outputs that can be produced by a reducer of size q for WFSTs composition.If there are n deterministic WFSTs (T i ) 1≤i≤n , each one having |T i | |A i | transitions for each input alphabet.Let T be the result of the composition T 1 • T 2 • . . .• T n .To compute a transition in T , a reducer needs to receive n transitions, one from each WFST T i .Then, the WFST T can reach n i=1 |T i | transitions.Assume that a reducer of size q receives q n transitions from each WFST T i evenly distributed such that the input alphabet of T i+1 coincides with the out- put alphabet of T i .The following lemma gives us an upper bound on the output of a reducer.

Lemma 1 When computing the composition
T n a reducer of size q can produce no more than g(q) = q n outputs.
From [16], one can compute a lower bound on the replication rate for the composition of WFSTs as a function of q using the following expression: where |I| denote the input size, and |O| denote the output size.The input size for our problem is the sum of transitions from all input WFSTs T i , that is |I| = n i=1 |T i | , and the size of the output is the size of T i.e.
Consequently, the lower bound on the replication rate for the composition of WFSTs will be as follows.

Methods
This section includes the description of three mapping methods in order to design a suitable key format that maps a set of transitions to the same reducer.Explicitly, we will define the function getTransitionFrom(t) called in line 2 of Algorithm 1.In this section, we also present a formal analysis of the communication cost by computing an upper bound on the replication rate for each mapping method.Recall that we consider the composition of n WFSTs, each having m states.

States based mapping method
In our first method, for a transition t ∈ E i from a WFST T i , the map function generates a set of keys of the form K state = (i 1 , i 2 , . . ., h(s[t]), . . ., i n ) , where h be a hash-function from Q i to {1, . . ., m} and i j ∈ {1, . . ., m} .Consequently, the mappers produce the set of key-value pairs of the form �(i 1 , i 2 , . . ., h(s[t]), . . ., i n ), t� .By way of explanation, suppose we have m n reducers, then each transition is sent to m n−1 reducers.
The function g(q) will be affected by the presence inside the same reducer of some transitions that cannot be combined.This give us a new upper bound on the number of outputs each reducer can produce, formally The following proposition gives an upper bound on the replication rate.

Proposition 2 The replication rate r in the state-based mapping scheme is
Notice that from Propositions 1 and 2, the upper bound on replication rate exceeds the lower bound by a factor of q n × |A 1 | .As a consequence, this mapping scheme is suitable when considering a small set of input alphabets.

Input alphabet based mapping
In the second method, transitions will be mapped to their input alphabet.Let us define the key K input = (j 1 , j 2 , . . ., j i , . . ., j n ) associated with an input symbol a and let h a be a hash-function with range {1, 2, . . ., k} .One can associate a transition ) ∈ E i from the WFST T i to a key K input if there exists some j i = h a (a i ) .Thus, we have n j=1 |A j | available reducers and each transition from T i is send to reducers.Since the map task processes n i=1 |T i | transitions and each WFST T i have transitions associated with an input symbol c ∈ |A i | , the total number of transitions sent to each reducer is n × |T i | |A i | which can be approximated by n × m .However, the function g(q) is influenced by the presence of incompatible output and input symbols combination inside a reducer.This gives us a new upper bound on the number of outputs each reducer can produce, formally

Proposition 3 The replication rate in the input alphabets based mapping scheme is
From the Propositions 1 and 3, the upper bound for the replication rate overtake the theoretical lower bound by a factor of q n×|Q 1 | .Therefore, the input alphabet based mapping is best suited for situations where the considered WFSTs sizes are small.

Hybrid mapping based on both input and output alphabets
In the last method, we propose a hybrid mapping based on both input and output alphabets.In other words, keys will be associated to a pair of input and output symbols.Formally, a transition t ∈ E i from a WFST T i will be paired to a set of keys having the form K hybrid = (j 1 , j 2 , . . ., h i a (t), h o b (t), . . ., j n ) , where h i a () be an input symbol hash functions from n i=1 A i to {1, 2, . . ., k} and h o b be from be an output symbol hash functions from n i=1 B i to {1, 2, . . ., k} .Transitions from T 1 will be mapped according to their output symbols and those of T n according to their input symbols.Consequently the number of reducers is k n .However the function g(q) = |Q 1 |.
Since transitions from T 2 until T n−1 are sent to k n−2 reducers, the following Proposi- tion holds.

Proposition 4
The replication rate in the hybrid mapping method is strictly less than the replication rate in the input symbol mapping method.
By comparing the Propositions 2, 3, and 4, we deduce that the upper bound of the replication rate in the hybrid mapping method is the closest one to the theoretical lower bound.Thus, this method is best suited for situations when the alphabet size is less than or equal to the number of states.

Results and discussion
This section includes extensive experiments to evaluate the efficiency and effectiveness of the proposed methods in term of communication cost and execution time for computing the composition of five WFSTs T 1 • T 2 • . . .• T 5 .Experiments are conducted on a large variety of WFST data sets randomly generated using FAdo library [31] with various combinations of attributes including a number of states |Q|, input alphabet size |A|, and output alphabet size |B|.

Cluster configuration
Our experiments were run on Hadoop on the French scientific testbed Grid'5000 [32] at the site of Lille.We used for our experiments a cluster having 15 nodes, 30 CPUs, 300 cores.Each node is a machine equipped with two Intel Xeon E5-2630 v4 with 10-cores processors, 256 GB of main memory, and two disk drives (HDD) at 300 GB.The machines are connected by 10 Gbps Ethernet network, and run 64-bit Debian 9.The Hadoop version installed on all machines was 2.7.

Data generation method
We randomly generated a large variety of WFST data sets in two phases.First, we generated a set of deterministic finite automata using FAdo library [31], which is an open source project providing a set of tools for the symbolic manipulation of automata.FAdo is based on enumeration and generation of initially connected deterministic finite automata [33].In the second phase, we implemented a function that randomly adds weights and some nondeterministic degree on the output symbols over transitions of the finite automata of the first phase with a uniform distribution.Our generation technique produces WFSTs based on two parameters: the number of states m and the input alphabet size k.Thus, one can define the transition density of the generated WFST as the ratio |E| k×m , the final state density as the ratio |F | m and consider a unique initial state as in [34].

Communication cost analysis
Let us evaluate the communication cost of the proposed methods on large scale WFST data sets.The communication cost is defined as the total number of key-value pairs transferred from the map phase to the reduce phase.It can be optimized by minimizing the replication rate parameter i.e. the number of input copies sent to the reducers.
The following table gives us the relationship between the considered data set sizes and the communication cost:

Input size Communication cost (GB) Output size
File size (kb) The obtained results show clearly that in general, the communication cost using the hybrid mapping method is minimal in all situations i.e. for all combinations of state sizes and alphabet sizes.This is due to the fact that the number of the transition copies sent to the reducers with this method is less than the other ones as proved formally in Proposition 4. In some particular cases of WFSTs, when the state size is less than the alphabet size, the state based mapping has less communication cost.This coincides with the Propositions 2 and 3.

Computation cost analysis
The Computation cost is the time required to execute a MapReduce job.The graphs below, Figs. 4, 5, 6 and 7, show comparative results in term of the execution time of the three methods: states based mapping (State), input alphabet based mapping (Input), and the hybrid based mapping (InOut).
Figures 4, 5 and 6 present the execution time of the three proposed methods for different data sets sizes when varying the alphabet size to be 16, 32 or 64.In the three cases, the growth rate of hybrid and input alphabet mapping are close together and much less of states mapping.For example when K = 16, the growth rate of InOut,   7 shows a comparison of the three methods when the alphabet size equals the number of states.As foreseen, the hybrid method is clearly more efficient when the alphabet size is less than or equal to the number of states, and the execution time by the transition in this method is lower than two other methods.
Minimizing the replication rate decreases the time used by the mappers to replicate each transition, and avoid the existence of transitions that cannot be combined inside the same reducer.At the same time, it reduces the number of transitions that are assigned to a reducer.On the other hand, using an adequate number of reducers diminishes the waiting time a reducer spends to use a CPU.

Conclusion
In this paper, we presented a new parallel approach to compute the composition of WFSTs on a large scale in MapReduce framework.We described in detail three methods to carry out this task using a single round of MapReduce.Moreover, we analyze the communication and computation cost for each method.Finally, we evaluated the performance of the three methods on different data sets.The obtained results show that the best method in terms of the execution time is the one that minimizes the number of reducers and optimizes the inputs replication rate.
As a perspective, this work is considered as the first step to apply this method on real world problems.First target application is the design of a new distributed cryptosystem based on finite automata, the so-called finite automaton public key cryptosystem.In this application, the public key is a composition of n + 1 finite automata, and, the private key is the n + 1 weak inverse finite automata of them [4].The second target application is in natural language processing to handle tasks such as translation between two languages, using one or multiple intermediate languages and for speech recognition.We also hope to extend and, test this method on a multi-nodes cluster environment with the GPU and OpenMP to accelerate the composition algorithm for large-scale computing.

1 t
← getTransitionFrom(value); // generate the set of keys associated with the transition t 2 setOf Keys ← getTransitionSetKeys(t); // replicate the transition t ; 3 foreach k in setOf Keys do 4 Emit(k,t) ; Algorithm 1: Map The Map function produces a set of key-value pairs from each input record.In other words, the input transition is replicated and associated with all the keys generated from it based on the mapping method (line 2 of Algorithm 1).The intermediate outputs add a replication rate factor in the cost of MapReduce algorithm.The outputs from the Map function are fed into the Shuffle step.We recall that the Shuffle step occurs automatically in our implementation.input : key, values pair, where values is a set of transitions having the same key key.output: Transitions of the resulting composed WFST.// group transitions w.r.t.their WFST index as a list of sets TransList 1 foreach transition t in values do 2 add t to TransList[index(t)]; // join transitions sets TransList[i] from TransList using the cartesian product 3 JoinedTransList = n i=1 T ransList[i]; // compute the composition operation from JoinedTransList 4 foreach element (t 1 , • • • , tn) in JoinedT ransList do 5 t = t 1 • • • • • tn; 6 Emit(null,t) ; Algorithm 2: Reduce

Fig. 4 16 Fig. 5
Fig. 4 Execution times of three methods for the alphabet size K = 16 Figure

Fig. 6 Fig. 7
Fig. 6 Execution times of three methods for the alphabet size K = 64 set of transitions; : I → K the initial weight function; and ρ : F → K the final weight function mapping F to K. The size of T , denoted by |T | , is the number of its transitions.Given a transition t ∈ E , we denote by s[t] its origin or start state, d[t] its destination state or next state, and w[t] its weight.A path π = t 1 . . .t k is an element of E * with consecutive transi- tions: d[t i−1 ] = s[t i ], i = 2, . . ., k .We extend s and d to paths by setting: s