Developing Classifications
Both Structured and Unstructured Discovery offer classification of the data obtained from their scans. For Unstructured Discovery, see Classifying File Contents.
We currently employ an ensemble of five methods for classifying structured data. They are resolved in the following order:
- Property Name Matching
- Regular Expression Content Matching
- Sombra Large Language Model (LLM) (if available)
- Transcend-Hosted Metadata-Only LLM
- Random Forest ML Model (deprecated)
The first pass at classification simply checks the name of the field (i.e. column) against the list of data categories. For example, a field named email
can reliably be classified as containing EMAIL
. Common variations and permutations are also considered, such as birth_date
matching DATE_OF_BIRTH
. Common prefixes and suffixes, such as the __c
found affixed to custom properties in Salesforce are also accounted for. This approach provides high specificity (few false positives), but low sensitivity (a lot of misses). This method tends towards 72% accuracy.
We allow users to define custom regexes to help with classification. This can be done by navigating to the "Inventory" tab in the "Data Inventory" view. Here you can add, edit, and delete custom regexes to help with classification for each data category.
Here again regular expressions should be defined with a goal of high specificity. That is, it is better to have a regular expression that catches some but not all cases than one that is overly broad and mislabels data. So, for example /[A-Z0-9a-z._%+-]+@[A-Z0-9a-z.-]+\.[A-Za-z]{2,}/
is a good regular expression for email addresses since anything it matches will be a valid email address and it is very unlikely to match anything that is not an email address. On the other hand, \d+-\d+-\d+-\d+
would not be a very good regular expression for a phone number, because while it will match many phone numbers, it would also match many things that are not phone numbers.
The custom regexes you define here will be used to help with classification in Structured Discovery. If a column in your data silo matches a custom regex, it will be classified as the data category you have defined in the "Data Inventory" view.
The results will appear with the labeled method of classification as "Regex Matching" in the "Datapoints" view in Structured Discovery.
Data Points that are not classified by either property name matching nor matching regular expressions on the content are classified according using one of our machine learning classifiers. The best of these is out hosted large language model (LLM) that runs in the customer's environment. This means--like the regex classifier--it is able to classify the content of the data points being scanned. This approach requires a minimum Classifier version of 3.0.6 and Sombra version of 7.226.3. Preliminary testing with this approach has yielded accuracy up to 82%.
The hosted classifier does require deployment of our LLM image onto GPU nodes. Our recommended instance type on AWS is g2.5xlarge
. As of this writing, the g2.5xlarge
has an on-demand price of $1.212 per hour. That can be reduced substantially with reserved pricing. See Hosted LLM Classifier for more details.
For customers not ready or willing to host their own LLM classifier, we are currently rolling out a new option that uses an LLM SaaS to classify data points. It cannot classify the actual data content, but is limited only to the metadata, principally the table and column name. Despite this limitation, we find it still does remarkably well under most circumstances, yielding an accuracy up to 70%. However, it has been known to make errors based on the field name that would not have been made had sample data been available.
This will become the fallback classifier in case the hosted LLM is unavailable.
Prior to the development of either LLM, we utilized a simpler model utilizing a popular machine learning technique known as random forests. We are currently in the process of removing it, but you may still see predictions from this legacy approach.
Each method utilizes different information in order to make its classification, sometimes in combination. For example, Property Name Matching uses the data point name (table and column); Regular Expressions match against the data point contents. The LLMs are also able to utilize the descriptions of both the data categories and the data points.
Method | Data Point Name (Metadata) | Data Point Description (Metadata) | Category Description | Data Point Contents (requires Sombra) |
---|---|---|---|---|
Property Name Matching | YES | NO | NO | NO |
Regular Expression Content Matching | NO | NO | NO | YES |
Self-Hosted Large Language Model | YES | YES | YES | YES |
Transcend-Hosted Metadata-Only LLM | YES | YES | YES | NO |
New and updated classification methods will automatically be applied to unconfirmed data points, that is, data points that do not yet have a confirmed data category label. By default, new guesses will be appended to the existing set of unconfirmed guesses.
It may become useful to clear out and reclassify a data point or data silo. If you would like to clear out the existing guesses and reclassify all of the unconfirmed data points in a data silo, Jump to Integration, select Structured Discovery and click Clean & Restart Classification
If you would like to reclassify an individual column, can do so by clicking the Resample and reclassify column icon that appears when you mouse over the column name.