The detail of the proposed method will be explained in this section. First, the general concepts related to the proposed method will be expressed, and then the details of the proposed semi-supervised feature selection method are introduced.
Background and notation
Let us review some definitions and concepts, which are the foundations of the proposed algorithm, before getting to the algorithm.
Neighborhoods and pairwise constraint
Laplacian ranking is the basis for the unsupervised method, including the selection of features with pairwise constraints, and in this method, the strongest feature in terms of the ability for preserving local is selected. The main key in assumptions in Laplacian feature selection is on the basis that the data belonging to the same class are closing together and more similar. Laplacian ranking of the rth feature of Lr that should be a minimum is expressed by Eq. (1):
$$ \left[ \begin{aligned} L_{\text{r}} &= \frac{{\mathop \sum \nolimits_{{{\text{i}},{\text{j}}}} \left( {{\text{f}}_{\text{ri}} - {\text{f}}_{\text{rj}} } \right)^{2} {\text{S}}_{\text{ij}} }}{{\mathop \sum \nolimits_{\text{i}} \left( {{\text{f}}_{\text{ri}} - \mu_{\text{j}} } \right)^{2} {\text{D}}_{\text{ij}} }} \hfill \\ {\text{D}}_{\text{ij}} &= \sum\nolimits_{j} {S_{ij} } \hfill \\ S_{ij} &= \left\{ \begin{aligned} e^{{ - \frac{{x_{i} - x_{j}^{2} }}{t} if x_{i} {\text{and }}x_{j} {\text{are}} {\text{neighborhood }}}} \hfill \\ 0 ,\quad {\kern 1pt} {\kern 1pt} \quad \;{\text{otherwise}}\; \hfill \\ \end{aligned} \right. \hfill \\ \end{aligned} \right] $$
(1)
which Sij can be expressed based on the relationship between the neighborhood and each data, and t is a fixed value that is initialized and neighborhood means that xi via the K of the nearest neighborhood reaches xj and neighborhoods can have various concepts such as the similarity of data to each other. Rankings expressed are unsupervised and use no other information except for the data set. This article uses concepts such as Laplacian ranking and neighborhood, and on the assump.ion that pairwise constraint exists as ML (Must-link) and CL (Cannot-link), it attempts to select and rank appropriate features. So, all ML and CL set with datasets are prepared. Then, using Eq. (2), it is attempted to rank features. It should be noted that with the use of concepts of the neighborhood.
$$ \left[ \begin{aligned} C_{\text{r}}^{1} &= \frac{{\mathop \sum \nolimits_{{ ( {\text{x}}_{\text{i }} ,{\text{x}}_{\text{j}} ) \in {\text{CL}}}} \left( {{\text{f}}_{\text{ri}} - {\text{f}}_{\text{rj}} } \right)^{2} }}{{\mathop \sum \nolimits_{{ ( {\text{x}}_{\text{i }} ,{\text{x}}_{\text{j}} ) \in {\text{ML}}}} \left( {{\text{f}}_{\text{ri}} - {\text{f}}_{\text{rj}} } \right)^{2} }} \hfill \\ C_{\text{r}}^{2} &= \left( {\mathop \sum \nolimits_{{\left( {{\text{x}}_{\text{i }} ,{\text{x}}_{\text{j}} } \right) \in {\text{CL}}}} \left( {{\text{f}}_{\text{ri}} - {\text{f}}_{\text{rj}} } \right)^{2} } \right) - \lambda \left( {\mathop \sum \nolimits_{{\left( {{\text{x}}_{\text{i }} ,{\text{x}}_{\text{j}} } \right) \in {\text{ML}}}} \left( {{\text{f}}_{\text{ri}} - {\text{f}}_{\text{rj}} } \right)^{2} } \right) \hfill \\ \end{aligned} \right] $$
(2)
In where, \( C_{\text{r}}^{1} \) and \( C_{\text{r}}^{2} \) represent two types of rankings based on the pairwise constraint. In fact, features are selected that have the best ability to protect constraints. If there are two samples are in the ML set so the relevant feature means that the feature values are close together. If the two samples are in the CL set, relevant feature means that features values are far apart. In the follow.ng, for each feature, two types of ranking are calculated and from the maximum value, two rankings, feature selection is done.
In general, if \( \left\{ {x_{i} ,x_{j} ,x_{k} } \right\} \) is the three data of the data set, then each pair’s relationship is expressed as \( \left\{ {{\text{ML}}, {\text{CL}}} \right\} \), and the clustering label is expressed with lab, then relations and Eq. (3) must be established. By closure of pairwise constraints, neighborhoods can be formed.
(3)
$$ \left[ \begin{aligned} \left( {x_{i} ,x_{j} ,Ml} \right) \wedge \left( {x_{i} ,x_{k} ,Ml} \right) \Rightarrow \left( {x_{j} ,x_{k} ,Ml} \right) \hfill \\ \left( {x_{i} ,x_{j} ,Cl} \right) \wedge \left( {x_{i} ,x_{k} ,Cl} \right) \Rightarrow \left( {x_{j} ,x_{k} ,Cl} \right) \hfill \\ \left( {x_{i} ,x_{j} ,Ml} \right) \Leftrightarrow lab_{i} = lab_{j} , \;in \;same \;cluster \hfill \\ \left( {x_{i} ,x_{j} ,Cl} \right) \Leftrightarrow lab_{i} \ne lab_{j} , \;not \;in \;same\; cluster \hfill \\ \end{aligned} \right] $$
(3)
Neighborhoods are a set of a neighborhood whose number is usually smaller or equal to the number of clusters defined in the algorithm. Each neighborhood includes several sample data that must be in the same cluster together. The basic premise in that neighborhood is that different data in different clusters should be placed in different neighborhoods, and no two Neighborhoods should be found where data exists as the same cluster.
Measuring the uncertainty of constraints
In the real world, constraints arise from domain knowledge or expert knowledge. Pairwise constraints have weak relationships, and strangeness (uncertainty) of the relations is variable. Hence, it is needed to create an uncertainty region. By finding the region, it is easy to have an impact on our ranking and see better results in reduced dimensions. In order to do this, the authors use the thresholding histogram method. This method actually used the classifying method with two classes, and its purpose is to reduce ambiguity in the range of values. First, the similarity values of each pair Sen matrix are collected, and then these values are divided into intervals, and the average of each interval is determined as (\( D_{i} \)). In the next step, for each interval, the number of pairs in this range is counted as the g (\( D_{i} \)). So, from these values, a weighted moving average with five windows, f (\( D_{i} \)), is calculated by Eq (4). The authors start from the beginning of the intervals and find the first valley points in the modified histogram \( f\left( {D_{v} } \right) \). Finally, the uncertainty region is calculated.
Step 1:
$$ f\left( {D_{i} } \right) = \frac{{g\left( {D_{i} } \right)}}{{\mathop \sum \nolimits_{e = 1}^{z - 1} g\left( {D_{e} } \right)}} \times \frac{{g\left( {D_{i - 2} } \right) + g\left( {D_{i - 1} } \right) + g\left( {D_{i} } \right) + g\left( {D_{i + 1} } \right) + g\left( {D_{i + 2} } \right)}}{5} , \forall i = 2,3, \ldots .,z - 3 $$
(4)
Step 2: find the first valley points subject to:
$$ f\left( {D_{v - 1} } \right) > f\left( {D_{v} } \right)\; {\text{and}}\; f\left( {D_{v} } \right) < f\left( {D_{v + 1} } \right) $$
(5)
Step 3: find the boundary of the uncertainty region:
$$ m_{d} = D_{v} {\text{and}} m_{c} = \hbox{max} (D_{i} ) - m_{d} $$
(6)
Step 4: find the pairs in similarity matrix that h.ve uncertainty relationship:
$$ Similarity Matrix S_{enij} : \left\{ \begin{aligned} &m_{d} \le if S_{enij} \le m_{c} :uncertainity\;region \hfill \\ & else:\quad \quad \quad \quad \quad \,\,\, strong\;region \hfill \\ \end{aligned} \right.\quad \forall i,j $$
(7)
Weights of the terms obtained
Given that each feature has a certain weight and importance, and not all features may be required for the machine learning task, so in the first step it is necessary to determine the weight of each feature. For this purpose, Laplacian Score (LS) is used. LS is an unsupervised univariate filtering method which is based on the observation that if a data point is close to each other; it may belong to the same class. The basic idea of LS is to evaluate the feature relevance according to its power of locality preserving. The LS for the feature \( A \) is determined using Eq. (8):
$$ LS\left( {S, A} \right) = \frac{{\sum\nolimits_{i,j} {(A(i) - A(j))} S_{ij} }}{{\sum\nolimits_{i} {(A(i) - \bar{A})D_{ii} } }} $$
(8)
where, A(i) represents the value of the feature A in the \( i \)-th a pattern, \( \bar{A} \) denotes the average of the feature A, D is a diagonal matrix that \( D_{ii} = \sum\nolimits_{j} {S_{ij} } \), and \( S_{ij} \) represents the neighborhood relation between patterns, calculated as Eq. (9):
$$ S_{ij} = \left\{ \begin{aligned} &e^{{\frac{{x_{i} - x_{j} }}{t} }} , \quad if x_{i} \text{ and} \, x_{j} \text{ are}\, neighbors \hfill \\ &0,\qquad \,\,\, otherwise \hfill \\ \end{aligned} \right. $$
(9)
where, \( t \) is a suitable constant, \( x_{i} \) represents i-th pattern, and \( x_{i} \) and \( x_{j} \) are neighbors if \( x_{i} \) is among \( k \) nearest neighbors of \( x_{j} \) or \( x_{j} \) is among \( k \) nearest neighbors of \( x_{i} \).
The proposed PCFS algorithm
In this section, a novel Pairwise Constraint Feature Selection method (PCFS) is proposed. This method uses pck-mean which is one of the soft constraints clustering algorithms with small and effective changes. The proposed method has been able to use both standard objective function and a penalty for the violation of constraints, with changing the objective function. These two sections together constitute the objective function and are locally minimized. The proposed method, named-Dim-reduce() function, is affected by the current clustering and vice versa.
Briefly, the data set are embedded as a data-term matrix, and then other variables values are initialized. The whole of the procedure is repeated in a loop until the clusters not changed (or with the predefined number of the loop). In each iteration, given the current clustering and set of constraints ML and CL, Dim-reduce() performs to produce a reduced feature (line 2). After this, neighborhoods are formed from the closure of pairwise constraints, and then the center of pairwise constraints of each neighborhood is calculated. If a neighborhood does not have any data, randomly a data, it should not be a member of other neighborhoods, is as the center of that cluster. Finally, centers of clusters are initialized by the center of neighborhoods (lines 3–6). For assigning clusters and estimating (updating) center of clusters, section A and B is performed (8–9). These two sections are repeated until convergence, as pck-means. After convergence, the procedure is repeated until meet stop conditions. Dim-reduce() function is the core of PCFS that is summarized in Algorithm 2. In this method, in addition to the usual input in feature selection, pairwise constraints arise as input.
There are two main functions in this algorithm that respectively, Sen-func() in algorithm 3 and Str-unc() in algorithm 4 are expressed. The first function extracts the matrix of similarities between data pairs, and then in the second function, the uncertainty region and strength of the relationship is calculated for each pair. After calculating the two functions within an iterative process, the authors rank the features by Eq. (10). Finally, Repeat will continue until the selected features are changed.
$$ C_{b} = \frac{{\mathop \sum \nolimits_{{\left( {x_{i} ,x_{j} } \right) \in ML}} \left( {f_{bi} - f_{bj} } \right)^{2} \times S_{trij} + \frac{{\left( {1 - S_{enij} } \right)}}{{\mathop \sum \nolimits_{{\left( {x_{k} ,x_{z} } \right) \in ML}} (1 - S_{enkz} )}} \times \left( {1 - S_{trij} } \right)}}{{\mathop \sum \nolimits_{{\left( {x_{i} ,x_{j} } \right) \in CL}} \left( {f_{bi} - f_{bj} } \right)^{2} \times S_{trij} + \frac{{\left( {1 - S_{enij} } \right)}}{{\mathop \sum \nolimits_{{\left( {x_{k} ,x_{z} } \right) \in cL}} (1 - S_{enkz} )}} \times (1 - S_{trij} )}} $$
(10)
In which, Strij indicates the quality (power) of the relationship between each data pairs, and each element in the matrix are calculated through the uncertainty region. For the ranking of features, this formula assumes that if the power of pairs (in the set of pairwise constraints) is low, the authors mostly use similarity matrix; otherwise, (in case of reliability and high strength of the relationship of pairwise), Minkowski distance is used. In fact, using this method, strength and quality are added to the formula, and thereby better results can be obtained. The summarization of calculating the similarity matrix is possible in algorithm 3. First, the authors assigned clusters as labels of data set (lines 3–6). Then the classification model is performed on the dataset with produced labels from clustering (line 8). In the iterative process, a similarity matrix based on anticipated labels (from the classification model) is created. During different iterations, this similarity matrix is updated and normalized.
Finally, the Matrix calculation of strength, Str and the uncertainty region as algorithm 4 is summarized. After finding the uncertainty region (line 3), it is time to calculate Str matrix. For data pairs that are in the uncertainty region, the relative strength of them is equal to β, and outside of this range, it is 1–β. This β parameter was chosen after several preliminary runs, and this the value of β is empirically considered as 0.3.