Connecting Data Stores

You can use Unstructured Discovery to discover datapoints within those systems.

To classify this personal data, first you need to connect your data stores; there are three ways to do this:

Navigate to Integrations on the left side menu. To add a data silo for datapoints scanning, click "Add Integration".

Navigate to the specific vendor from Infrastructure > Integrations and follow connection instructions. You may be prompted to connect with OAuth or another authorization protocol.

Connecting S3 data silo

Once connected, click on the Unstructured Discovery tab and turn on the "Unstructured Data Discovery" and “Unstructured Data Sampling" plugins.

Unstructured Discovery inside the Integration
  1. What is the potential cost of scanning, sampling and classification?

    • We enumerate the entire file system you choose to scan. The cost will largely depend on how expensive accessing the file system is for you + Sombra hosting costs if you are using self-hosted Sombra.
    • Sampling and classification may incur further cost as the objects must be retrieved in order to be sampled for classification.
    • Classification requires the use of a hosted classifier service which, Like Sombra, may incur its own hosting costs. See Hosted LLM Classifier for more details.
  2. What file types are supported for sampling and classification?

    • PDFs, TXTs, CSVs, JSONs, GDOCs, GSLIDEs, GSHEETs, GFORMs, DOCXs, XLSXs, PPTXs, ZIPs, BINs.
  3. Is it sampling all data?

    • It scans all files, but samples only up to the first 50 MB of any given object.