Dissimilarity space reinforced with manifold learning and latent space modeling for improved pattern classification

Dissimilarity representation plays a very important role in pattern recognition due to its ability to capture structural and relational information between samples. Dissimilarity space embedding is an approach in which each sample is represented as a vector based on its dissimilarity to some other samples called prototypes. However, lack of neighborhood-preserving, fixed and usually considerable prototype set for all training samples cause low classification accuracy and high computational complexity. To address these challenges, our proposed method creates dissimilarity space considering the neighbors of each data point on the manifold. For this purpose, Locally Linear Embedding (LLE) is used as an unsupervised manifold learning algorithm. The only goal of this step is to learn the global structure and the neighborhood of data on the manifold and mapping or dimension reduction is not performed. In order to create the dissimilarity space, each sample is compared only with its prototype set including its k-nearest neighbors on the manifold using the geodesic distance metric. Geodesic distance metric is used for the structure preserving and is computed using the weighted LLE neighborhood graph. Finally, Latent Space Model (LSM), is applied to reduce the dimensions of the Euclidean latent space so that the second challenge is resolved. To evaluate the resulted representation ad so called dissimilarity space, two common classifiers namely K Nearest Neighbor (KNN) and Support Vector Machine (SVM) are applied. Experiments on different datasets which included both Euclidean and non-Euclidean spaces, demonstrate that using the proposed approach, classifiers outperform the other basic dissimilarity spaces in both accuracy and runtime.

this method is a bridge between statistical and structural approaches in pattern recognition [3].
One of the most challenging points of this embedding is the prototype selection approach, which includes the selection strategy and the number of prototypes. It will have a very significant impact on reducing the computational complexity of measuring dissimilarity, the dimensions of the final embedding space and subsequently improving classifier's performance. Another significant point is the lack of attention to the data structure and preserving neighbors of samples while constructing and embedding to the dissimilarity space.
Our contribution in this paper is to provide a method for selecting prototypes considering the manifolds and data structure, then reducing dimensions and complexity of the proposed dissimilarity space. In the proposed method, prototype set P is not fixed for all samples and each sample has its own specific prototypes which are its neighbors on the manifold. So, one sample is not forced to measure its differences with all other members of prototype set. Also dimensions of the proposed dissimilarity space is reduced by Latent Space Model. In this way, set P is pruned and dimension of dissimilarity space is reduced so it is likely that the computational complexity will decrease while the performance of the classifiers in terms of accuracy and runtime can be improved.
The rest of the paper is organized as follows. "Related works" section describes the related works in this area. Then, in "Background knowledge" section, explanations are given about the LLE method, geodesic distance and the latent space model as the tools used in the proposed method. Then the proposed approach is fully explained step-bystep in "The proposed method" section. In "Experimental results and discussion" section, the evaluation datasets and the evaluation scenarios are discussed and the results are analyzed. Finally, "Conclusion" section summarizes the whole paper and proposes some guidance for the future works.

Related works
In 1985, Goldfrab proposed to compute the differences between structural data to represent them rather than a feature-based representation for the first time. Since it was not a useful approach, it could not be conspicuous enough for researchers. Pseudo-Euclidean space was also proposed by Goldfrab in the same year [3,4]. After 1995, Duin and Pȩkalska [1] began their research in this area, and eventually in 2005 introduced a distance-based method called the dissimilarity representation as a substitute for the feature-based representation. Dissimilarity space was first presented in 2008 by Pȩkalska [5] as a way to map structural data to vector space, so that traditional learning algorithms can be applied. Also, in 2012, the dissimilarity space was described as a bridge between structural and statistical approaches [3].
Nowadays deep learning models are pioneer in surveillance applications, but to achieve higher performance in reidentification persons, dissimilarity space is a better approach [22,23]. In [22] dissimilarity representation is used to identify differences occurred in source and target captured videos. Also a dissimilarity-based loss function which is optimized by gradient descent is suggested to apply in identification system systems.
Prototype selection has an important role in reducing dimensions of dissimilarity space. In the prototype selection approaches, the most diverse samples have to be selected as prototypes and among similar samples only one will be selected [24]. In 2006, Pȩkalska et al. [25] proposed some methods for pruning the prototype set (P) such as Random, RandomC, Kcenters, and EditCon. Also Calaña et al. [26] used a genetic algorithm as a method for prototype selection. In 2015, a method for generating and selecting prototypes for structural data is presented [12]. Principal Component Analysis (PCA) is used as a method to reduce the dimension of the dissimilarity space [18] and Genetic Algorithm (GA) applied in the approximate search method in dissimilarity space [27].
In 2021, Silva et al. proposed a prototype generation method and estimates optimal number of prototypes [28]. This algorithm in first step with a Self-Organizing Maps neural network generates a reduced training samples then in next step uses a classifier based on the nearest informative neighbors.
Recent approaches of prototype selection applied to multi-label data classification [29], click fraud detection [30] and generating business process model [31] are proposed in the last two years. In [30], first of all divides imbalanced dataset into four groups and applies under-sampling technique to balance dataset. At the second step, only relevant samples are chosen as prototypes in order to reduce size of training set.
In [48] a novel prototype learning approach is proposed which is a modified version of Zero-shot learning which learns prototypes from unseen classes via training a classification model. This approach is a simple and efficient approach to learn prototypes from the class-level semantic information.
In all previously studied works, the prototype set P is constant for all samples. That is, any sample must necessarily be compared to all the members of prototype set P, and then the dissimilarity space is created. Therefore, in this paper the proposed set P is defined dynamically and unevenly for all samples. In other words, a sample is not required to be compared with all the prototypes in set P to create the dissimilarity space, but also considering the data structure and neighborhood of each sample on the manifold, it is just compared with a limited set of members. The main goals of this new framework are to reduce the complexity and dimensions of space and also to achieve higher classification performance and lower runtime.
Since similar data which belong to the same class have little differences, so they are closer to each other in the dissimilarity space, and vice versa. As a result, it's likely that different class samples will be more distinct in this space, and so construction of classifiers will be more justified [25]. The most commonly used classifiers in the dissimilarity space are KNN and Linear SVM [3,7,14,18]. In spite of the existence of higher performance classifiers, KNN classifier is one of the most widely used classifiers in this space, because of its simplicity and good behavior at metric-based spaces. Therefore, in this paper the KNN and SVM classifiers are chosen to evaluate our proposed method.

Locally linear embedding (LLE)
This learning algorithm recognizes the fundamental structure of the manifold and reduces dimensions. LLE is nonlinear, based on Eigen vectors and preserves the local neighborhood of samples [32]. It has three main steps: Finding l neighbors for each x i sample with D dimensions. Reconstructing each sample based on the weighted linear combination of its l nearest neighbors and obtaining the optimal weights W ij by minimizing the following equation. In other words, a weighted neighborhood graph is constructed in this step and W ij =0 if x j is not neighbor of x i [33]: With the optimal weights W ij and minimizing the following equation, new embedded y i vectors with d dimensions (d < D) can be obtained [32].
Some of its advantages are simplicity of implementation, escaping from local minima, mapping high dimensional data into a single coordinate system of lower dimensions and faster rate in calculating the Eigenvalues of the matrix M = I − W T (I − W ) due to the fact that many weights are zero [32,34].

Geodesic distance
In graph theory, the sequence of edges between two vertices is called path. The length of the path between two vertices in an unweighted graph is the number of edges and in a weighted graph is the sum of the weights of the edges between them. The distance between two vertices, which is equal to the length of the shortest path between them, is called geodesic distance [35].
In this step, an undirected weighted graph is created according to the LLE manifold. Each point on the LLE manifold is assumed as a vertex ( y i ) and Euclidean distance of two adjacent vertices is the weight of their connected edge (Fig. 1).
In order to compute geodesic distance between two vertices y i andy j , if they are neighbors, the following equation is used: But if they are not neighbors, the length of shortest distance is approximately considered as geodesic distance [33].

Latent Space Model (LSM)
Latent Space Model (LSM) was proposed in 2002 by Hoff et al. [36] to estimate similarity between nodes and their connectivity in the social networks. In this model, for each node x i in the X N * N input network, there is an embedded position z i in the unseen latent spaceZ N * Q . This latent space has three characteristics: (1) Euclidean property: many traditional classification algorithms can be performed in this space.
(2) Its dimensions are lower than initial input network' dimensions (Q ≪ N ) so this model can be used as dimension reduction method. (3) The probability of the communication edge,Y ij , between the two nodes x i and x j in the network, depends on the Euclidean distance between their positions in the latent space, z i and z j [36]. The process of the model is as follows: • Initialization The latent space Z N * Q is initialized randomly with a normal distribution with zero mean and unit variance.
• Inference In this step, Maximum a Posteriori (MAP) method is used to estimate hidden space Z under the model. Equation 6 estimates the latent variable Z by the MAP method where X presents input network and Z presents the latent space.
In order to estimate the hidden variable Z by MAP, the gradient descent optimization method is used in a definite number of iterations until it converges to local optima [37].
• Estimate connected edges This step is optional unless you are looking to analyze latent space and to estimate connectivity patterns. As the probability of being the edge Y ij between the two nodes x i and x j in the network depends on the Euclidean distance between their positions in the latent space, z i and z j so z i − z j is calculated. Then the probability of the edge, Y ij is computed in terms of the Poisson probability distribution function.

The proposed method
The framework of the proposed method as illustrated in Fig. 2 consists of five steps. In the following, each step of the approach is described as an algorithm.
In Algorithm 1, we use the Locally Linear Embedding (LLE) algorithm, which is one of the neighborhood-preserving manifold learning algorithms to learn the data manifold. The input matrix X N * D contains training samples. N is the number of samples in training set, and D presents the number of the dimensions of each sample. The output matrix Y N * d where N is the number of samples and d shows the dimension of the embedded samples where d is lower than D ( d ≪ D) . The LLE algorithm consists of three steps, which is described more in details in "Locally linear embedding (LLE)" section.
Learn data manifold using the LLE algorithm Compute geodesic distance Define prototype set and create dissimilarity space Dimension reduction using Latent Space Model (LSM) Train the classifiers and evaluate the classification performance In Algorithm 2, first an undirected weighted graph is constructed according to the data manifold obtained by the LLE algorithm. Each new embedded point on the LLE manifold is assumed as a vertex ( y i ) that is connected to the l− nearest neighbors ( e ij ) and Euclidean distance between them is considered as the weight of their connected edge ( w ij ). Now geodesic distance matrix is computed.
First, the geodesic distance matrix is sorted ascending and the k-nearest neighbors of each sample on the manifold are selected to define the prototype set P i of each sample x i (Eq. 9). It should also be noted that the farthest neighbor (α i ) of each sample is also determined.
In step 2, dissimilarity space is created as a matrix DS N * N whose diagonal elements are zero.
As the default value for entries, the distance between each sample and its farthest neighbor (α i ) on the manifold is placed in the matrix DS (Eq. 11). Then, the Euclidean distance of each sample x i with its k-nearest neighbors on the manifold {p i1 . . . . .p ik } is computed and is placed in DS (Eq. 10). It should be noted that we prefer to use the initial samples ( X N * D ) before any dimension reduction, so the dissimilarity space is not constructed using manifold embedding Y N * d .
In algorithm 4, Latent Space Model (LSM) is used to simplify and to reduce the dimensions of dissimilarity space. First, the latent space with N*Q dimensions that N is the number of training samples and Q is lower than N, is initialized randomly with zero mean and unit variance. Then in a loop with the user-defined number of iterations, the Maximum a Posteriori (MAP) method estimates the optimal positions in the latent space Z N * Q . To implement this step, Edward library [37], which is based on Python programming language, is used. Final augmented dissimilarity space is obtained by inner product of DS N * N and Z N * Q as in the following equation (Eq. 12).
Finally, classifiers such as KNN and SVM are trained by the N-by-Q matrix of DS Augmented and classify new queries which are mapped into the augmented space. In order to evaluate the results, accuracy and runtime are considered as classification performance measurements.

Data sets
Since the proposed method is not limited to a particular problem, nine Euclidean and non-Euclidean data sets that are resulting in 10 dissimilarity spaces are used to evaluate the approach (more than one measure and different number of classes and samples are considered in these dissimilarity spaces). Datasets of classification problems in University of California Irvine (UCI) machine learning repository [38] such as iris, glass, wine, breast cancer, monk, ionosphere radar signals and 8 × 8 images of optical recognition of handwritten digits are used in this paper. USPS (United States Postal Services) [39] and MNIST (Mixed National Institute of Standards and Technology) [40] are two other standard data sets for the handwritten digit recognition problem and they are chosen to evaluate our approach. The images in USPS are 16 × 16 inches and they are shown as records with 256 features. Also, Euclidean and two-way tangent [41,42] measures are chosen to construct the dissimilarity spaces of USPS data set. MNIST includes 70,000 handwritten digits which are centered in images with 28 × 28 pixels. The general information about these datasets is indicated in Table 1.

Performance measurement
In this study, accuracy measure, that is the most instinctive performance measure, is used to evaluate the classification performance of the proposed approach and also is reported with the measure of mean and standard deviation in K-fold cross validation as an evaluation method. This measure in Eq. 13 is a ratio of correctly classified samples (t) includes true positives plus true negatives to the total samples (n) [43].
F1-score is another metric which used in this paper as a measure of test's accuracy and calculated based on precision and recall and is shown in Eq. 14. Tp,fp and fn are abbreviations of true positive, false positive and false negative identified results [44].
In addition, the efficiency of the proposed method is also evaluated in terms of runtime. The runtimes have been obtained on a laptop with an Intel ® Core ™ i5-3230 M CPU @ 2.8 GHz with 8 GB RAM.

Evaluation scenarios
Four scenarios are considered for the evaluation of the proposed approach: 1. Evaluating the whole proposed framework New dissimilarity space according to the proposed algorithm using predefined prototype number list for each dataset ( Table 2) is constructed and it will be embedded into the lower dimension latent space. Then classifiers such as KNN with different K values (i.e., 1, 3, 5, 7, 11), linear SVM and polynomial SVM (degree = 2, 3) are used to evaluate the method in terms of accuracy and runtime. 2. Comparing with basic dissimilarity spaces The proposed method is compared in both accuracy and runtime with basic dissimilarity spaces which are defined as follows: • DS_ALL: All samples in the training set are selected as the representatives. In other words, the representative set R is equal to the training set [3]. • DS_Random: K samples are randomly selected as the representatives [25]. • DS_RandomC: From each class C, k random samples are selected as the representatives (K = k × C) [25]. 3. Evaluating the LSM phase In this case the classifiers are trained on the dissimilarity space DS N * N without any dimension reduction and mapping into the latent space. Doing so, the impact of the LSM and dimension reduction on the performance of the classifiers is determined. 4. Comparing with the recent similar works The highest classification performance of the proposed method is compared with three similar recent publications in term   of accuracy on the same datasets. More details about these papers are described in "Comparing with the recent approaches" section. Table 2 shows the values of the required parameters for each dataset including the number of neighbors needed to learn the LLE Manifold (k_LLE), the latent space dimensions (lsm_dimension), the number of iterations of estimation in LSM (lsm_iteration), the number of folds in cross validation (K-fold) evaluation scheme and the number of neighbors of each sample on the manifold that should be considered as prototypes in proposed method (k_neighbor_manifold). It should be noted that in this study the optimal k-nearest neighbors of LLE (k_LLE) is defined with an approach introduced in paper [45] and other parameters are tuned by a trial and error method.

Evaluating the whole proposed framework
As described in previous sections, the Locally Linear Embedding (LLE) algorithm, is learned the data manifold. Figure 3 shows the LLE embedded neighborhood graph with optimal number of neighbors.
The LLE embedded neighborhood graph of MNIST data due to its huge samples is ambiguous so its manifold based on first two features is shown in Fig. 4.
To compute geodesic distance, the undirected weighted graph is constructed based on the obtained LLE manifold. In this graph, as suggested in paper [45] each point is connected to the small fixed number of neighbors ( l = 10). Tables 3 and 4 report the accuracy and runtime of the proposed method on Iris dataset. The numerical value in the columns shows the number of the nearest neighbors of each sample on the manifold which are determined as the prototypes for constructing dissimilarity space (k_neighber_manifold). The results of the Linear SVM shows that the maximum accuracy achieved by our proposed method with comparing each point to only its 20-nearest neighbors on the manifold, is 98% at 0.01s. This is while the accuracy of other classifiers in dissimilarity spaces considering with different number of prototypes are lower than 98. Figure 5 demonstrates the classification error curve of the proposed method versus the number of prototypes of each sample on the manifold, on 10 evaluation data sets and optimal prototype numbers for each dataset are shown in this figure.

Comparing with the basic dissimilarity spaces
In Table 5 the best and the average accuracy of the proposed method on each dataset and in Table 6 the best and the average F1-score are compared with other basic dissimilarity spaces and the initial dataset. The numerical values in the parenthesis demonstrate the optimal number of prototypes and the best classifier on each dataset. It should be noted that the number of random prototypes in DS_Random and DS_RandomC are equals with optimal number of prototypes in proposed method. Figures 6 and 7 demonstrate the comparison results of the proposed method and other dissimilarity spaces based on the best and the average accuracy, respectively.

Evaluating the LSM phase
In this section, the proposed method, is evaluated in terms of accuracy and runtime with and without incorporating the LSM approach. This makes it possible to determine the impact and importance of using the LSM on the classifier's efficiency. In Table 7, the best and average accuracy of the proposed method with and without LSM method are shown on all datasets and indicate the superiority of the proposed    5 Classification error of the proposed method versus the number of prototypes for the evaluation datasets method over the method without LSM in most cases. It is due to the effect of the LSM on reducing the size of the dissimilarity space. Figures 8 and 9 shows the comparison results of the proposed method with and without LSM based on the best and the average accuracy, respectively.
Also, in terms of runtime, as reported in Table 8, the proposed method requires less time for classification. In Figs. 10 and 11 comparison results between two cases are demonstrated more clearly. Since the classification runtime of MNIST dataset is not comparable to other datasets due its large scale, it is not shown in Figs. 10 and 11 and just is reported in Table 8.

Comparing with the recent approaches
In this section, the highest classification accuracy of the proposed method is compared with three latest papers on the same datasets. In 2015, Calvo et al. [12] proposed a twostage algorithm to generate prototypes on structural data using dissimilarity space. First the structural data is embedded into the dissimilarity space and then the common prototype generation methods are applied. In 2017 a measure of ultrametricity is introduced for creating dissimilarity space [36]. Then the performance of KNN classifier on the   transformed space was studied on Iris and Ionosphere datasets. Structured Sparse Dictionary Selection (SSDS) was proposed in 2017 by Wang et al. [24] as a novel prototype selection method. Table 9 shows that the proposed method outperforms the other three latest methods on the same datasets which besides the runtimes reported in Sect. 5.4.3 denotes the impact and effectiveness of the proposed dissimilarity space representation.

Conclusions
The dissimilarity space is a vector space which dimensions are determined by comparing differences between each sample in the training set and all members of the prototype set P. Therefore, prototype selection strategy and the size of the prototype set are so important and effective [1]. However, in previous works, the prototype set is identical for all samples [15,25,26], that is, all of them must necessarily be compared with all the members in the prototype set P. Also the neighboring of each sample and data structure on the manifold are not considered in prototype selection or mapping into dissimilarity space. Therefore, in this paper, a method is proposed to augment the dissimilarity space with manifold learning and increase the classification performance, considering the neighborhood of each sample on the manifold. In the proposed method, first the data manifold is learned by the Linear Locally Embedding algorithm (LLE) [32,34]. Then an undirected weighted graph is constructed according to the data manifold obtained by the LLE algorithm and the geodesic distance matrix is computed. The k-nearest neighbors of each sample on the manifold using the geodesic distance are selected as members of the prototype set P.  Fig. 9 The best classification accuracy of proposed method with and without LSM on evaluation datasets In the proposed process of constructing the dissimilarity space, it is enough to compare each sample with its k-nearest neighbors on the manifold, and not all the members of the prototype set P. This is an effective factor in reducing the computational complexity. In next step the Latent Space Model simplifies and reduces dimensions of  dissimilarity space more and embeds all samples to the Euclidean latent space which has a lower dimensionality than the constructed dissimilarity space [36]. Finally, the SVM and KNN classifiers are trained on the augmented dissimilarity space and the proposed method is evaluated in terms of classification accuracy and runtime. Advantages of the proposed method are as follows: • Creating the prototype set P dynamically and unevenly for all samples, so that each sample is not required to be compared with all the prototypes in the set P, but only compared with its k-nearest neighbors on the manifold and then the dissimilarity space is created. As a result, this has a great effect on reducing the computational complexity. • The proposed complete framework, in comparison with the method without LSM, requires less classification runtime, which indicates the positive effect of the LSM model on complexity reduction of the augmented dissimilarity space. • The proposed framework outperforms the accuracy of the latest similar approaches. • In most scenarios, the linear SVM classifier on the proposed method can achieve the highest accuracy.
The algorithm for estimating and embedding the dissimilarity space into the latent space is time-consuming. Therefore, using alternative algorithms as the future works can improve the runtime of the proposed method. Also, in the LSM method, determining the optimal number of latent space dimensions and the number of iterations for the estimation step will greatly influence the efficiency of the model and classifiers.