Classifying File Contents

Identifying personal data in unstructured files presents unique challenges compared to structured databases. Files like PDFs, documents, and spreadsheets can contain sensitive information in various formats without the clear structure of database tables and columns.

Unstructured Discovery addresses this challenge by analyzing file contents using advanced classification techniques to identify personal information patterns. This helps you:

Discover personal data hiding in document repositories
Identify compliance risks in unstructured content
Map personal data across your entire information ecosystem
Support data minimization initiatives

Unstructured Discovery employs two primary methods to classify personal data in file contents:

This method uses pattern matching to identify specific data formats within file contents. You can configure your own regular expressions to recognize data patterns specific to your organization:

Navigate to Data Inventory > Data Categories
Select a data category
Add or edit regular expressions for that category

Adding custom regular expressions to help with classification

Best practice: Create specific patterns that avoid false positives. For example:

Good: /[A-Z0-9a-z._%+-]+@[A-Z0-9a-z.-]+\.[A-Za-z]{2,}/ for email addresses
Problematic: \d+-\d+-\d+-\d+ for phone numbers (too generic, would match many non-phone numbers)

When the system finds content matching your defined patterns, it classifies that content according to the associated data category.

For more sophisticated identification of personal data, Unstructured Discovery can use Named Entity Recognition:

Uses AI to identify personal data that may not follow strict patterns
Runs in your environment for maximum data security
Can identify names, addresses, IDs, and other personal information in context
Achieves approximately 70% precision and 71% recall in testing

This approach requires:

Minimum Classifier version 3.3.1 and Sombra™ version 7.261.0
GPU-enabled infrastructure (recommended: AWS g2.5xlarge instance)

For deployment details, see our Hosted LLM Classifier documentation.

After scanning and classifying your unstructured data, you'll see results organized by:

File location: Where the file containing personal data was found
File type: The format of the file (PDF, DOCX, etc.)
Data categories: What types of personal data were identified
Classification method: Which method identified the personal data

This information helps you understand where sensitive data exists in your unstructured content and take appropriate actions for privacy compliance.

Classifying File Contents

Understanding Unstructured Data Classification

Classification Methods

1. Regular Expression Content Matching

2. Named Entity Recognition (NER)

Classification Results