Classifying File Contents

In the classification phase, Unstructured Discovery classifies the data inside each file with data categories (the same data categories found in your Data Inventory). The system can identify a wide range of personal data, providing a detailed overview of the personal data within the data store.

We currently employ two methods for classifying unstructured data:

  1. Regular Expression Content Matching
  2. Named Entity Recognition (if available)

We allow users to define custom regular expressions to help with classification. This can be done by navigating to the "Inventory" tab in the "Data Inventory" view. Here you can add, edit, and delete custom regular expressions to help with classification for each data category.

Adding custom regular expressions to help with classification

Regular expressions should be defined with a goal of high specificity. That is, it is better to have a regular expression that catches some but not all cases than one that is overly broad and mislabels data. So, for example /[A-Z0-9a-z._%+-]+@[A-Z0-9a-z.-]+\.[A-Za-z]{2,}/ is a good regular expression for email addresses since anything it matches will be a valid email address, and it is very unlikely to match anything that is not an email address. On the other hand, \d+-\d+-\d+-\d+ would not be a very good regular expression for a phone number, because while it will match many phone numbers, it would also match many things that are not phone numbers.

The custom regular expressions you define here will be used to help with classification in Structured Discovery. If a column in your data silo matches a custom regular expression, it will be classified as the data category you have defined in the "Data Inventory" view.

The results will appear with the labeled method of classification as "Regex Matching" in the "Datapoints" view in Structured Discovery.

Named entity recognition, or NER, uses the same pod as the Large Language Model for content classification of structured data. Like Sombra, this pod runs in the customer's environment, and so—like the regex classifier—it is able to classify the content of the objects being scanned. This approach requires a minimum Classifier version of 3.3.1 and Sombra version of 7.261.0. Preliminary testing with this approach has yielded a precision of 70% and a recall of 71%.

The hosted classifier does require deployment of our LLM image onto GPU nodes. Our recommended instance type on AWS is g2.5xlarge. As of this writing, the g2.5xlarge has an on-demand price of $1.212 per hour. That can be reduced substantially with reserved pricing. See Hosted LLM Classifier for more details.