Companies have a lot of internal data stored in cloud computing services. Identifying where all that data is stored and understanding what the data is used for across systems can be tricky and time consuming. The first step in classifying internal data is to identify the data storage systems where the data lives, or silo discovery.
Transcend’s AWS integration automates the process of identifying data stores across customers’ AWS cloud infrastructure. By programmatically scanning an AWS account for data storage services, we remove the manual process needed to identify databases, data warehouse and object storage systems across AWS accounts. From there, the data itself can be surfaced and classified for each system.
The integration is built to surface commonly used AWS data storage services, including S3, DynamoDB and RDS databases. The sections below go into more detail with specifics about how each service is identified and surfaced.
The AWS integration uses the listBuckets method to identify whether any S3 buckets exist. This serves to validate whether your company uses S3 for object storage.
If the integration discovers any buckets, an S3 data silo will be recommended for Data Inventory. It’s worth noting that the integration can scan for buckets independently of a specified region. This means we will only recommend a single S3 data silo per AWS account.
The integration uses the listTables command to identify DynamoDB tables in the AWS account. Because Dynamo tables are region specific, the integration scans every available AWS region for tables. The integration was built this way to ensure complete discovery of all tables with as little user configuration as possible. Instead of having the user connecting the integration lookup/consult an engineer to get a list of all the regions they host resources in, the integration programmatically searches across regions.
Once a table is discovered, a DynamoDB data silo will be recommended. It’s worth noting that a silo will be recommended for every region that contains a DynamoDB table.
To illustrate an example, let’s look at the diagram below. There are three DynamoDB tables in the AWS account across two regions. The silo discovery integration will recommend a Dynamo data silo that corresponds to each region containing a table.
The integration scans AWS for all instances of RDS (Amazon’s Relational Database Service) to surface all databases hosted in the account. The integration uses the describeDBInstances method to surface every instance of a database. Similar to DynamoDB, RDS is a region specific service in AWS. This means that each RDS instance is specific to an AWS region. The integration scans across every region to ensure complete discovery.
For each database instance found, a database data silo will be recommended for Data Inventory. RDS supports several database engines, including MySQL, PostgreSQL, SQL Server, Oracle, Amazon Aurora and MariaDB. Each recommended database silo will correspond to the correct database system. Let’s look at the example depicted below. There are three instances of RDS across two regions. In this case, the silo discovery plugin will recommend a database silo for each RDS instance encountered, regardless of what region it’s hosted in.
Once all of the AWS data stores & systems are identified and added to Data Inventory, the next step is understanding what data is stored in each and the purpose of that data.
In addition to the global AWS silo discovery integration, Transcend has integrations with the individual AWS services that get discovered. Our database integrations, S3 integration and DynamoDB integration are part of the larger suite of the AWS integrations Transcend offers. Once the relevant data silos have been discovered, each one can be enabled for datapoint discovery & content classification.
The S3 integration supports a datapoint discovery plugin to programmatically create datapoints in Transcend that represent pieces of data in S3. When enabled, the plugin scans the AWS account to identify all S3 buckets using the listBuckets] command. Each bucket found is recommended as a datapoint, and a new datapoint is surfaced when new buckets are created. Each datapoint discovered is classified as to the type of data it represents to help customers identify which datapoints may contain personal information and which don’t.
The DynamoDB integration can be enabled for datapoint discovery as well. The integration will scan for all tables using the listTables method and recommend each table as a datapoint, and get the attributes for each table using describeTable and surface them as sub-datapoints. The discovered sub-datapoints are classified through Content Classification. Remember that if the DynamoDB silo was discovered through the AWS integration scan, the silo is scoped to a single AWS region and not a single Dynamo Table.
Content Classification for the database integrations works similarly to the DynamoDB integration. Database data silos discovered & added during an AWS integration scan can be enabled for datapoint discovery and classification. Each table in the database will correspond to a recommended datapoint, and the columns on each table will correspond to sub-datapoints for the database silo. Discovered sub-datapoints are classified through Content Classification to help customers prioritize datapoints that may contain personal information.
More information about the database integration can be found here.
The integration is scoped to a single AWS account. If you have multiple AWS accounts to connect, please add a data silo for each account and follow the steps below.
Authenticating the AWS integration requires a new IAM user to be created in the AWS account to be connected. One of the benefits of using an IAM user to integrate AWS is the ability to designate only the specific permissions Transcends needs and define a custom trust policy. To create a new IAM role, login in to the AWS console and navigate to Roles → Create Role. More information about IAM roles can be found in the AWS documentation.
Create a new role in the AWS IAM console.
- Choose AWS account as the role type.
- Select This account or enter the Account ID for the AWS account that will be connecting to Transcend.
Select the role type and account.
- In Select trusted entity, choose AWS account as the role type and select This account or enter the Account ID for the AWS account to be connected to Transcend.
Add Transcend’s AWS Account ID to the role’s trust policy.
- Next, select Custom trust policy and enter Transcend’s AWS Account ID (829095311197).
Add the External ID to the trust policy.
Select Require external ID
Transcend auto-generates an external ID to be shared between Transcend and the customer. Copy/paste the External ID provided in the Transcend AWS data silo.
Including an External ID in the trust policy adds an additional level of security to the integration. It ensures that even with the correct IAM role, Transcend cannot access AWS resources without the external ID.
Define the permissions for the role.
- Select Create Policy and select the JSON tab.
- Transcend has created a JSON policy with the permissions needed for the integration. Copy & paste this policy from Transcend's connection form under AWS IAM Role,
or manually add the following permissions: dynamodb:ListTables, rds:DescribeDBInstances, s3:ListAllMyBuckets.
If your organization would like to use content classification for a DynamoDB database, please also include the additional permission dynamodb:DescribeTable.
- Click Next.
Add tags (Optional).
- Adding tags is not required for the integration, but it won’t interfere either if tagging roles is part of an internal business process.
- Click Next.
Review and name the policy.
- In Review policy, give the policy a distinct name (ex: AWSDataSiloDiscoveryPlugin or TranscendAWSIntegration).
- Copy this name to Transcend's connection form input named AWS IAM Role.
- Click Create Policy.
Finally, enter your Account ID to Transcend's connection form input named AWS Account ID.
Connect the integration.
Once the integration is connected, enable the silo discovery plugin to start scanning AWS.
Once the scan is complete, select View Data Inventory to review and approve the discovered AWS resources.
The AWS resources discovered by the plugin are available for review by selecting X Resources Found. From there, review each discovered AWS resource to decide if it should be approved.
Each discovered resource contains additional Metadata from AWS. This includes information like name, resourceID and the region of the resource, which can be helpful in understand which discovered resource corresponds to a specific configured data store.
To approve a recommended data silo, select Add to Data Inventory.
Once a resource has been approved, it’s added to Data Inventory. Data silos that are discovered & added through the AWS integration will inherit the IAM role and account ID used to authenticate the AWS integration. This allows Transcend to connect data silos for additional discovered resources like S3 automatically.
Once a discovered data silo has been approved and added to Data Inventory, it can be configured to further scan the individual resources to identify and classify information stored within. To enable content classification for a resource, simply navigate to the Configuration tab of desired data silo and enable the Datapoint Discovery plugin.
The plugin works by scanning a resource or system to identify datapoints within the system and classify them. The example below shows a scan of a PostgreSQL database discovered by the AWS plugin. In this case the plugin scans the tables in the database, and recommends a datapoint for every column in each table.