Classifying File Contents
Identifying personal data in unstructured files presents unique challenges compared to structured databases. Files like PDFs, documents, and spreadsheets can contain sensitive information in various formats without the clear structure of database tables and columns.
Unstructured Discovery addresses this challenge by analyzing file contents using advanced classification techniques to identify personal information patterns. This helps you:
- Discover personal data hiding in document repositories
- Identify compliance risks in unstructured content
- Map personal data across your entire information ecosystem
- Support data minimization initiatives
Unstructured Discovery employs two primary methods to classify personal data in file contents:
This method uses pattern matching to identify specific data formats within file contents. You can configure your own regular expressions to recognize data patterns specific to your organization:
- Navigate to Data Inventory > Data Categories
- Select a data category
- Add or edit regular expressions for that category

Best practice: Create specific patterns that avoid false positives. For example:
- Good:
/[A-Z0-9a-z._%+-]+@[A-Z0-9a-z.-]+\.[A-Za-z]{2,}/
for email addresses - Problematic:
\d+-\d+-\d+-\d+
for phone numbers (too generic, would match many non-phone numbers)
When the system finds content matching your defined patterns, it classifies that content according to the associated data category.
For more sophisticated identification of personal data, Unstructured Discovery can use Named Entity Recognition:
- Uses AI to identify personal data that may not follow strict patterns
- Runs in your environment for maximum data security
- Can identify names, addresses, IDs, and other personal information in context
- Achieves approximately 70% precision and 71% recall in testing
This approach requires:
- Minimum Classifier version 3.3.1 and Sombra version 7.261.0
- GPU-enabled infrastructure (recommended: AWS
g2.5xlarge
instance)
For deployment details, see our Hosted LLM Classifier documentation.
After scanning and classifying your unstructured data, you'll see results organized by:
- File location: Where the file containing personal data was found
- File type: The format of the file (PDF, DOCX, etc.)
- Data categories: What types of personal data were identified
- Classification method: Which method identified the personal data
This information helps you understand where sensitive data exists in your unstructured content and take appropriate actions for privacy compliance.