Databricks Integration

Transcend maintains an integration for Databricks and Databricks Lakehouse databases that supports Structured Discovery and DSR Automation functionality, allowing you to:

  • Scan your database to identify datapoints that contain personal information
  • Programmatically classify the data category and storage purpose of datapoints
  • Define and execute DSRs directly against your database

The first step to connecting Databricks to Transcend is to add the Databricks integration through the Connect Integrations page.

Create a service principle via these instructions and generate a Client ID and Secret. Be sure to add Account Admin role to the Service Principle.

Enter the Account ID, Account URL, Client Id and Client Secret string into the Integration Connection Form in Transcend and select Connect.

The first step to connecting Databricks Lakehouse to Transcend is to add the Databricks integration through the Connect Integrations page.

Create a service principle via these instructions and generate a Client ID and Secret. Be sure to add USE_CATALOG, USE SCHEMA, SELECT, and MODIFY privileges to the Service Principle for each catalog that you want Transcend to have access to.

Enter the SQL Warehouse ID, Warehouse URL, Client Id and Client Secret string into the Integration Connection Form in Transcend and select Connect.

The Datapoint Schema Discovery plugin in the Databricks Lakehouse integration allows you to programmatically scan your database to identify the pieces of data in your DB and pull them into Transcend as objects and properties. Once the data is in Transcend, they can be classified, labeled, and configured for DSRs.

The plugin operates by sampling the database, generating an object within Transcend for each identified collection. Additionally, it discovers embedded arrays within these collections, each of which is also returned as an object, prefixed with the name of the parent collection for clarity. The plugin then delves deeper into the documents within each collection, and tracks every property it encounters, inclusive of nested properties. This comprehensive scanning process ensures a thorough mapping of your database structure within Transcend.

To enable the datapoint schema discovery plugin, navigate to the Structured Discovery tab within the Databricks Lakehouse data silo and toggle the plugin on. From there, you'll be able to set the frequency for which the plugin will run to discover new objects and properties as they are added to the database. Note: We recommend scheduling the plugin to run at times when the load on the database is lightest.

Transcend's Structured Discovery tool automatically classifies the data discovered in your database. By leveraging machine learning techniques, we can categorize and recommend the processing purpose for each piece of data discovered. With Structured Discovery, Transcend helps you keep your Data Inventory up to date through inevitable database schema changes. Check out our full Structured Discovery guide for more information about how it works.

With the Transcend Databricks and Databricks Lakehouse integration, you can fulfill DSRs directly against a Databricks or Databricks Lakehouse database by running Databricks operations with our custom JSON payload for the desired data actions on each datapoint.

The first step to setting up DSRs against the Databricks Lakehouse database is creating the datapoints in the data silo that should be queried. We typically recommend creating a datapoint for each collection in the database that stores personal data (or any collections you want to action DSRs against). For example, let's say there is a collection called Chat History that contains all the messages sent back and forth from a customer. You could create a datapoint for Chat History in the data silo and enable the specific data actions needed. If you're using Structured Discovery, you can enable the Datapoint Schema Discovery plugin to create the datapoints for you automatically.

Pro tip: Check out the Transcend Terraform Provider for options on managing data silos and data points in code.