Skip to main content

Table 5 Sample metrics definition

From: Addressing big data variety using an automated approach for data characterization

Occurrence

Classification

Metrics definition

Value/add-on contribution

to confidence level

Credit Cards

Card

RegEx Identified

40%

Linguistic boundary

20%

No linguistic boundary

10%

Luhn algorithm

40%

Exists in institutional BINs

5%

Sanitization method

Masking

(first six and last three chars)

Confidence Level

60%

PII

Lists

RegEx Identified

40%

Linguistic boundary

20%

No linguistic boundary

10%

Proximity

10%

Sanitization method

Hash

Confidence Level

50%

PII

Absolute XML

RegEx Identified e.g. (< CIVIL_ID >)

100%

Sanitization method

Truncate

Confidence Level

50%

PII

Relative XML

RegEx Identified e.g. (< *ID* >)

50%

Sanitization method

Truncate

Confidence Level

50%