The non-linear nature of the cost of comprehensibility

Goethals, Sofie; Martens, David; Evgeniou, Theodoros

doi:10.1186/s40537-022-00579-2

Journal of Big Data

Table 4 Dataset properties used in the analysis

From: The non-linear nature of the cost of comprehensibility

Metafeature name	Description
AttrConc (mean)	Concentration coef. of each pair of distinct attributes
AttrEnt (mean)	Shannon’s entropy for each predictive attribute
AttrToInst	The ratio between the number of attributes
C1	The entropy of class proportions
C2	The imbalance ratio
CanCor (mean)	Canonical correlations of data
CatToNum	The ratio between the number of categoric and numeric features
ClassConc (mean)	Concentration coefficient between each attribute and class
ClassEnt	Target attribute Shannon’s entropy
ClsCoef	Clustering coefficient
Cor (mean)	The absolute value of the correlation of distinct dataset column pairs
Cov (mean)	The absolute value of the covariance of distinct dataset attribute pairs
Density	Average density of the network
Eigenvalues (mean)	Eigenvalues of covariance matrix from dataset
EqNumAttr	Number of attributes equivalent for a predictive task
F1 (mean)	Maximum Fisher’s discriminant ratio
F1v (mean)	Directional-vector maximum Fisher’s discriminant ratio
F2 (mean)	Volume of the overlapping region
F3 (mean)	Feature maximum individual efficiency
F4 (mean)	Collective feature efficiency
FreqClass (mean)	Relative frequency of each distinct class
Gmean (mean)	Geometric mean of each attribute
Gravity	Distance between minority and majority classes center of mass
Hmean (mean)	Harmonic mean of each attribute
Hubs (mean)	Hub score
InstToAttr	Ratio between the number of instances and attributes
IqRange (mean)	Interquartile range (IQR) of each attribute
JointEnt (mean)	Joint entropy between each attribute and class
Kurtosis (mean)	Kurtosis of each attribute
L1 (mean)	Sum of error distance by linear programming
L2 (mean)	OVO subsets error rate of linear classifier
L3 (mean)	Non-Linearity of a linear classifier
LhTrace	Lawley-Hotelling trace
Lsc	Local set average cardinality
Mad (mean)	Median Absolute Deviation (MAD) adjusted by a factor
Max (mean)	Maximum value from each attribute
Mean (mean)	Mean value of each attribute
Median (mean)	Median value from each attribute
Min (mean)	Minimum value from each attribute
MutInf (mean)	Mutual information between each attribute and target
N1	Fraction of borderline points
N2 (mean)	Ratio of intra and extra class nearest neighbor distance
N3 (mean)	Error rate of the nearest neighbor classifier
N4 (mean)	Non-linearity of the k-NN Classifier
NrAttr	Total number of attributes
NrBin	Number of binary attributes
NrCat	Number of categorical attributes
NrClass	Number of distinct classes
NrCorAttr	Number of distinct highly correlated pair of attributes
NrDisc	Number of canonical correlation between each attribute and class
NrInst	Number of instances (rows) in the dataset
NrNorm	Number of attributes normally distributed based in a given method
NrNum	Number of numeric features
NrOutliers	Number of attributes with at least one outlier value
NsRatio	Noisiness of attributes
NumToCat	Number of numerical and categorical features
Ptrace	Pillai’s trace
Range (mean)	Range (max - min) of each attribute
RoyRoot	Roy’s largest root
Sd (mean)	Standard deviation of each attribute
SdRatio	Statistical test for homogeneity of covariances
Skewness (mean)	Skewness for each attribute
Sparsity (mean)	(Possibly normalized) sparsity metric for each attribute
T1 (mean)	Fraction of hyperspheres covering data
T2	Average number of features per dimension
T3	Average number of PCA dimensions per points
T4	Ratio of the PCA dimension to the original dimension
TMean (mean)	Trimmed mean of each attribute
Var (mean)	Variance of each attribute
WLambda	Wilks’ Lambda value

Back to article page