S3Parquet Integration

For companies that want to discover and classify files stored in S3 with the Apache Parquet format, we offer the S3 Parquet integration.

Similarly to the AWS integration, S3 Parquet is also scoped to a single AWS account. If you have multiple AWS accounts to connect, you must add a data silo for each account.

  1. Add the integration.

    • You must first add the S3 Parquet integration to your Transcend account and visit the connection form.

  1. Add ENABLE_FLASK_SERVER variable to Sombra's environment.

This step is only required if you are self-hosting Sombra. In that case, you must configure the Sombra docker container with the environment variable ENABLE_FLASK_SERVER set to true. This enables a flask server that runs as a sidecar in the Sombra container and is responsible for fetching the parquet files from S3.

  1. Create a new role in the AWS IAM console.

    • In Select trusted entity, select AWS account as the trusted entity type. Here you will put 829095311197 (Transcend's Account ID).
    • Select Require external ID and copy/paste the External ID provided on our integration connection page (see below).
    • Click Next.

  2. Define the permissions for the role.

    • Either create or pick a policy containing the s3:ListBucket and s3:GetObject policies. The former allows us to read the files from within your buckets. The latter lets us read the files themselves.
    • If you want Transcend to scan all of your buckets in S3, you must also add the s3:ListAllMyBuckets permission. Otherwise, you can manually insert the bucket names when connecting the integration in the Admin Dashboard. For more info, see step 5 below.
  3. Name, review, and create the role.

    • In the next page, pick a name for your Role (e.g., TranscendS3ParquetRole) and add a description.
    • Review the selected trusted entities and permissions, and create the role.
  4. Fill out the connection form.

    • Head back to the connection form in the Admin Dashboard.
    • Add the name of the created role, as well as your AWS Account ID and region.
    • If you didn't add the s3:ListAllMyBuckets in step 3 above, you may enter a comma-separated list of one or more buckets to be scanned. If you do so, transcend will only scan the listed buckets.
  5. Connect the integration.

Once the integration is connected, enable the Datapoint schema discovery plugin to start scanning AWS.

Once the scan is complete, select Browse Data Silo Schema to review and approve the discovered data points.

After enabling schema discovery, you can also enable content classification. When this plugin runs, it will read samples of data from the discovered dataPoints suggest data categories that you can tag them with.