AWS Integration

Transcend can continuously discover and classify your data stores hosted on AWS, as well as the data inside each store. This guide will help you connect your Amazon Web Services account to continously discover data stores hosted on AWS. Then we will show you how to enable Structured Discovery for each data store to discover and classify the data inside. Finally, we will show you how to configure DSR Automation for each data store.

Transcend’s AWS integration automates the process of identifying data stores across your AWS cloud infrastructure—a process we call Silo Discovery. By programmatically scanning an AWS account for data storage services, we remove the manual process needed to identify databases, data warehouse and object storage systems across AWS accounts. From there, the data itself can be surfaced and classified for each system.

The integration is built to surface commonly-used AWS data storage services, including S3, DynamoDB, and RDS databases. The sections below go into more detail with specifics about how each service is identified and surfaced.

The AWS integration uses the ListBuckets method to identify whether any S3 buckets exist. This serves to validate whether your company uses S3 for object storage.

If the integration discovers any buckets, an S3 data silo will be recommended for Data Inventory. It’s worth noting that the integration can scan for buckets independently of a specified region. This means we will only recommend a single S3 data silo per AWS account.

The AWS integration uses the ListTables method to identify DynamoDB tables in the AWS account. Because DynamoDB tables are region-specific, the integration scans every available AWS region for tables.

Once a DynamoDB table is encountered, Transcend will recommend the DynamoDB data silo in your Data Inventory. A silo will be recommended for each region that contains a DynamoDB table.

To illustrate an example, let’s look at the diagram below. There are three DynamoDB tables in the AWS account across two regions. The AWS integration's Silo Discovery plugin will recommend a DynamoDB data silo that corresponds to each region containing a table.

The integration scans AWS for all instances of RDS (Amazon Relational Database Service) to surface all databases hosted in the account. The integration uses the DescribeDBInstances method to surface every instance of a database. Similar to DynamoDB, RDS is a region-specific service in AWS. This means that each RDS instance is specific to an AWS region. The integration scans across every region to ensure complete discovery.

For each database instance found, a database data silo will be recommended for Data Inventory. RDS supports several database engines, including MySQL, PostgreSQL, SQL Server, Oracle, Amazon Aurora, and MariaDB. Each recommended database data silo will correspond to the detected database engine. Let’s look at the example depicted below. There are three instances of RDS across two regions. In this case, the Silo Discovery plugin will recommend a database silo for each RDS instance encountered, regardless of what region it’s hosted in.

Once all of the AWS data stores & systems are identified and added to Data Inventory, the next step is understanding what data is stored in each and the purpose of that data.

In addition to the global AWS Silo Discovery integration, Transcend has separate integrations with each AWS service that gets discovered. Once the relevant data silos have been discovered, each one can be enabled for datapoint schema discovery & datapoint classification.

The S3 integration supports Structured Discovery to programmatically create datapoints in Transcend that represent pieces of data in S3. When enabled, the plugin scans the AWS account to identify all S3 buckets using the listBuckets command. Each bucket found is recommended as a datapoint, and a new datapoint is surfaced when new buckets are created. Each datapoint discovered is classified as to the type of data it represents to help customers identify which datapoints may contain personal information and which don’t.

Additionally, if you have S3 buckets that are holding Parquet files or JSON files, you can use either the S3 Parquet or S3 JSONL Integration to index the schemas of those files as if you were indexing a proper SQL or no-SQL Database. When you configure and enable one S3 integration in Transcend, this will scan all buckets the credentials you entered has access to, and count as one integration.

The DynamoDB integration can be enabled for datapoint schema discovery as well. The integration will scan for all tables using the ListTables method and recommend each table as a datapoint, and get the attributes for each table using DescribeTable and surface them as sub-datapoints. The discovered sub-datapoints are classified through Structured Discovery. Remember that if the DynamoDB silo was discovered through the AWS integration scan, the silo is scoped to a single AWS region and not a single Dynamo Table.

Structured Discovery for the database integrations works similarly to the DynamoDB integration. Database data silos discovered & added during an AWS integration scan can be enabled for datapoint schema discovery and classification. Each table in the database will correspond to a recommended datapoint, and the columns on each table will correspond to sub-datapoints for the database silo. Discovered sub-datapoints are classified through Structured Discovery to help customers prioritize datapoints that may contain personal information.

More information about the database integration can be found here.

The integration is scoped to a single AWS account. If you have multiple AWS accounts to connect, please add a data silo for each account and follow the steps below.

Authenticating the AWS integration requires a new IAM user to be created in the AWS account to be connected. One of the benefits of using an IAM user to integrate AWS is the ability to designate only the specific permissions Transcends needs and define a custom trust policy. To create a new IAM role, log in to the AWS console and navigate to Roles → Create Role. More information about IAM roles can be found in the AWS documentation.

  1. Add the integration.

    • You must first add the integration to your Transcend account and visit the connection form.

  2. Create a new role in the AWS IAM console.

    • In Select trusted entity, select AWS account as the trusted entity type. Here you will put:

      • For Multi-tenant Sombra, include 829095311197 (Transcend's Account ID).
      • For Self-hosted Sombra, include the account ID of where the self-hosted Sombra is hosted in
    • Select Require external ID and copy/paste the External ID provided on our integration connection page (see below).

    • Click Next.

    • Once the role is created, you can navigate to the Trust Relationships tab of the role and the trusted entity should generally look something like this:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": [
              "arn:aws:iam::<transcend-or-self-hosted-sombra-account-id>:root"
            ]
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringEquals": {
              "sts:ExternalId": "5ab4defb0bef8e2bf4b9e636c3949f3edff73c5fb2eb0f56d84a913bd38b8445"
            }
          }
        }
      ]
    }
    
  3. [Self-hosted Sombra] Ensuring Sombra instance can assume the created role.

    • For self-hosted Sombra, make sure that the Sombra instance has the permission to also assume the role.
    • If you host Sombra via our Terraform Module, you may add this by updating the roles_to_assume variable.
    • Otherwise, adjust the IAM permissions attached to the Sombra instance to have sts:AssumeRole permissions on the IAM Role you created.
      • If your Sombra instance is hosted on an ECS cluster, you would also need to include the role attached to that cluster as well
    • The policy should generally look something like this:
    {
      "Action": ["sts:AssumeRole"],
      "Resource": [
        "arn:aws:iam::<sombra-account-where-role-lives>:role/<name-of-role>",
        "<any other resources ...>"
      ],
      "Effect": "Allow"
    }
    
  4. Define the permissions for the role.

    • Select Create Policy and select the JSON tab.

    • Transcend has created a JSON policy with the permissions needed for the integration. Copy & paste this policy from Transcend's connection form under AWS IAM Role, or manually add the following permissions: dynamodb:ListTables, rds:DescribeDBInstances, s3:ListAllMyBuckets.

      • If your organization would like to use Structured Discovery for a DynamoDB database, please also include the additional permission dynamodb:DescribeTable.
      • If your organization would like to fulfill DynamoDB DSRs through custom PartiQL queries, please also include some of the following permissions according to which PartiQL query you use. Likely, the mapping from action type to required permissions will be the following:
        • Access requests: dynamodb:PartiQLSelect
        • Erasure requests: dynamodb:PartiQLDelete
        • All other requests: dynamodb:PartiQLUpdate, and/or dynamodb:PartiQLInsert
    • In addition, if you want to control all the different AWS integrations using a single policy, then the policy should include all the required permissions from all the integrations and look like this:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "VisualEditor0",
          "Effect": "Allow",
          "Action": [
            "s3:GetObject",
            "s3:ListAllMyBuckets",
            "s3:GetBucketLocation",
            "s3:ListBucket",
            "dynamodb:PartiQLSelect",
            "dynamodb:ListTables",
            "dynamodb:DescribeTable",
            "dynamodb:GetItem",
            "dynamodb:Scan",
            "dynamodb:Query",
            "rds:DescribeDBInstances",
            "redshift:DescribeClusters"
          ],
          "Resource": "*"
        }
      ]
    }
    
    • Click Next.
  5. Add tags (Optional).

    • Adding tags is not required for the integration, but it won’t interfere either if tagging roles is part of an internal business process.
    • Click Next.
  6. Name, review, and create the role.

    • In the next page, pick a name for your Role (e.g., "TranscendS3Role") and add a description.
    • Review the selected trusted entities and permissions, and create the role.
  7. Finally, enter your Account ID to Transcend's connection form input named AWS Account ID.

  8. Connect the integration.

Once the integration is connected, enable the Silo Discovery plugin to start scanning AWS.

Once the scan is complete, select View Data Inventory to review and approve the discovered AWS resources.

The AWS resources discovered by the plugin are available for review by selecting X Resources Found. From there, review each discovered AWS resource to decide if it should be approved.

Each discovered resource contains additional Metadata from AWS. This includes information like name, resourceID and the region of the resource, which can be helpful in understand which discovered resource corresponds to a specific configured data store.

To approve a recommended data silo, select Add to Data Inventory.

Once a resource has been approved, it’s added to Data Inventory. Data silos that are discovered & added through the AWS integration will inherit the IAM role and account ID used to authenticate the AWS integration. This allows Transcend to connect data silos for additional discovered resources like S3 automatically.

Once a discovered data silo has been approved and added to Data Inventory, it can be configured to further scan the individual resources to identify and classify information stored within. To enable Structured Discovery for a resource, simply navigate to the Structured Discovery tab of desired integration and enable the Datapoint Schema Discovery plugin.

The plugin works by scanning a resource or system to identify datapoints within the system and classify them. The example below shows a scan of a PostgreSQL database discovered by the AWS plugin. In this case the plugin scans the tables in the database, and recommends a datapoint for every column in each table.

With the Transcend DynamoDB integration, you can fullfil DSRs directly against a DynamoDB database by running custom PartiQL queries for the desired data actions on each datapoint.

The first step to setting up DSRs against a DynamoDB database is creating the datapoints in the data silo that should be queried. We typically recommend creating a datapoint for each table in the database that stores personal data (or any tables you want to action DSRs against). For example, let's say there is a table called Conversations that contains all the messages sent back and forth from a customer. You could create a datapoint for Conversations in the data silo and enable the specific data actions needed. If you're using Structured Discovery, you can enable the Datapoint Schema Discovery plugin to create the datapoints for you automatically.

For each data action enabled for a datapoint in the DynamoDb data silo, you can define a PartiQL query that will execute a database operation. Using the previous Conversations example, let's say you want to enable the datapoint to support access requests. With the “access” data action enabled, you can define a specific query that executes the request to find the Conversation for a user against the database.

For example, assuming the Conversation table has an email attribute, a custom query could be

SELECT * from "Conversation" where email = ?

When fulfilling a DSR for a given user, Trascend will replace all ? characters in the query by the actual identifier of the user.

Important: as explained in the Connecting the integration section above, you must add the necessary permissions to the role you created, depending on which PartiQL operations you are executing.