From: Addressing big data variety using an automated approach for data characterization
Occurrence | Classification | Metrics definition | Value/add-on contribution to confidence level |
---|---|---|---|
Credit Cards | Card | RegEx Identified | 40% |
Linguistic boundary | 20% | ||
No linguistic boundary | 10% | ||
Luhn algorithm | 40% | ||
Exists in institutional BINs | 5% | ||
Sanitization method | Masking (first six and last three chars) | ||
Confidence Level | 60% | ||
PII | Lists | RegEx Identified | 40% |
Linguistic boundary | 20% | ||
No linguistic boundary | 10% | ||
Proximity | 10% | ||
Sanitization method | Hash | ||
Confidence Level | 50% | ||
PII | Absolute XML | RegEx Identified e.g. (< CIVIL_ID >) | 100% |
Sanitization method | Truncate | ||
Confidence Level | 50% | ||
PII | Relative XML | RegEx Identified e.g. (< *ID* >) | 50% |
Sanitization method | Truncate | ||
Confidence Level | 50% |