Data Governance System Introduction

DataShyft’s Data Governance System (DGS) is built to keep track of data items processed by DataShyft integration and orchestration pipelines. When data items enter a pipeline, whether by being generated by components or retrieved from external systems, the data item is assigned a unique tracking ID. This tracking ID travels through the pipeline along with the data item. Whenever a component in the pipeline transforms the data item, stores it in an external system, or the data item is transferred from one DataShyft system to another, the data governance system will record that action. Notably, the DGS records the details of any external system that data is retrieved from or written to for auditing purposes.

Access to data items can be revoked to indicate that access is no longer allowed. When this occurs, the Data Governance components in the pipeline will know to stop passing revoked items down the pipeline and will trigger the deletion portion of the pipeline to process any revoked items that have been stored in external systems. The deletion process invokes appropriate delete components to remove the data from the external system and record the successful deletion in the DGS.

By tracking data items, the DGS has a complete history of the data items movements throughout a pipeline. This history can be fed to existing reporting tool to create reports detailing the processing, storage, and deletion of data, providing an automated way to certify the deletion of data from external systems.

Core Principles

The Data Governance system within DataShyft allows for the tracking of data as it is shared, both internally and externally, between systems to facilitate the automatic removal of shared data when the data is no longer required or authorized for use by the external systems. When the data access expires, the Data Governance system triggers automatic data removal processes to remove the data shared with the other systems. Once removal is complete, the Data Governance records can be used to generate reports verifying the lifecycle of the data and certifying its removal.

The Data Governance capabilities of DataShyft are designed to track data as it flows through a data integration and orchestration pipeline. The system tracks when and where data is retrieved from external systems, the generation of new data from existing data (i.e. transformations), the movement of data from one system to another within a deployed pipeline, and the delivery/storage of data into an external system.

Registering Data

DataShyft components that retrieve or generate data (e.g. Readers and Generators) will automatically register each of the data items they retrieve or generate with the data governance system. This registration process both assigns the data item a unique tracking ID (its DGS ID) and records its retrieval in the tracking database. This tracking record contains information identifying the component that registered the data, the system the component was running on, the external source of the data (if appropriate), and the timestamp when the registration occurred. The assigned DGS ID travels with the data item through the entire pipeline and is used in subsequent data governance operations to identify the data item that is being processed.

Tracking Data Movement

When data moves from one system within a pipeline to another, the data governance system records this movement. A tracking record is added to the tracking database that records the arrival of the data item on the new system and which system it came from.

Deriving Data

Using Transformation components, a pipeline can use input data items to derive new data items. The new data items can be as simple as filtered versions of the input data, or as complex as generating multiple outputs form a single input (e.g. breaking up a CSV file into multiple records). When new data is returned from a transform, the new data items are automatically registered with the data governance system as derivatives of the original input item. In this way, a provenance chain is created that allows the data governance system to know the parentage of all the data items it contains. This is important for managing data lifecycle and revoking access to data.

Storing Data

DataShyft components that send data to external systems (e.g. Writers) record this storage in the data governance systems. A tracking record is created for the DGS ID of the stored item that contains the ID of the component that wrote the data out, the system on which the storage component was running, the external source to which the data was written, the timestamp when the storage occurred, and a record containing information on how the data items was stored and how to remove it from the system (e.g. unique keys, paths, entity IDs). Recording this information provides the automatic cleanup process with the information it needs to remove each of the data items from the external system when required.

DGS Statuses

All data items tracked by the DataShyft DGS have a status assigned to them. Many of these statuses are assigned automatically by the DataShyft components. Some of them can be explicitly assigned using a DGS Status Updater component. There six statuses that a tracked data item may have within the DataShyft DGS.

REGISTERED — When a tracked data item is first registered, it is set to the REGISTERED status. This status may also be accompanied by information on the resource the newly registered data item was retrieved from.

RECEIVED — The RECEIVED status is used to indicate that a tracked data item has arrived on another system in the deployment. This status is set automatically when a data item moves from one system to another.

STORED — The STORED status is used to indicate that a tracked data item has been stored in an external system. It is accompanied by a JSON object containing the metadata (e.g., keys and values, paths, entity IDs, etc.) needed to remove the tracked data item from the external system. This status is set automatically when a data item is output by any of DataShyft’s Writer components. It may also be accompanied by information on the resource into which the data item was stored.

REVOKED — The REVOKED status is used to indicate that access to a tracked data item is no longer allowed. It indicates that the tracked data item should be removed from any externals systems to which it has been added. This status can be set explicitly using the DGS Status Updater component, or automatically via the expiration process or cascade deletion logic.

DELETED — The DELETED status is used to indicate that a tracked data item has been removed from a storage location. DataShyft’s various Deleter components automatically set this status on any data item that they delete from an external resource. The Get Revoked Items component will automatically set data items to this state if they have been revoked but haven’t been stored anywhere. An orchestration pipeline can explicitly set this state using the DGS Status Updater component if needed.

RETAINED— The RETAINED status is used to indicate that a tracked data item is a work product and should not be revoked either by the automatic expiration and revocation process, or by a cascade revocation of a parent. It can still be explicitly revoked using the Status Updater if required.

Data Provenance

The Data Governance System tracks the provenance of data within a deployment. This allows the relationship between data items to be tracked and for the revocation of a parent item to automatically trigger the revocation of child items. There are three ways for this parent-child relationship to be created.

First, when data is moved from one system to another, it will automatically be registered on the new system. This will create a record indicating that the data item on the destination system was derived from the data item on the source system.

Second, if a component returns new data items in response to receiving an input data, such as the Database readers, the retrieved items will be registered as being derived from the triggering input item.

Thirdly, if a Transform component returns new data items, those data items are registered as having been derived from the input item. The transform can control how the derivation is recorded by attaching a dgsParents tag to the returned items explicitly specifying the parent IDs. If the tag is present, the item is registered as normal, and a record is created for each listed parent indicating that the newly registered item was derived from that parent. In this way, it is possible to have a data item that is derived from multiple parents.

Cascade Revocation

When a data item is revoked, either explicitly by a pipeline or implicitly by the DGS database, DGS will execute a cascade revocation. A cascade revocation involves the revocation of all data items derived from the revoked item. The system will recursively find all items that are descendants of the revoked item and set their status to Revoked. The cascade revocation process allows the data owner to ensure that any data items derived from their original item are revoked along with the original data, preventing proliferation and loss of control of their data.

Cascade revocation has a few limitations. First, the status of items that are already revoked or which have been marked as deleted are not updated. Second, if a derived item has been marked as retained, it will not be updated to revoked, but the revocation process will continue to the children of the retained item.

Data Retention

The DataShyft Data Governance System allows a pipeline to specify that certain data items are being retained and should never be automatically revoked. This retained status is intended for work products or other data items that are intended to remain once access to the original data is revoked. By marking a data item as retained, DGS will not automatically revoke the item, even if it matches the criteria for automatic revocation, and will not revoke the item when one if its ancestors is revoked. A retained data item may be explicitly revoked, however, by setting its status to Revoked using a DGS Status Updater component.

DGS Context

DGS components operate within a context. All components in a single deployment use the same context so that they can share Data Governance state with one another. The default value for the context is the ID of the deployment. This ensures that multiple deployments don’t interfere with each other’s operation.

In some cases, however, it may be necessary to have to have multiple deployments that use the same context. This can include situations where a deployment is restated and needs to resume operations from where it was previously, or situations where multiple different deployments are operating in concert with one another (e.g. when using Data Exchange components). In these cases, the deployments’ context can be specified using the dgs.context blueprint setting. The value given to this setting is used as the context for all deployments created from this blueprint. If the same value is used in different blueprints, then deployments from each of those blueprints will operate in the same shared context.

DGS Cleanup

The Data Governance database stores information about all the data items being tracked across all a customer’s deployments. This means that it can accumulate large quantities of data. The Data Governance database has both an active and a historical part. The active portion of the database will have a limited clean-up process performed automatically on it. Additional manual cleanup may be appropriate in cases where the automatic cleanup is insufficient.

Automatic Cleanup

The Data Governance System will automatically consider cleaning up data for data items whose status has been set to DELETED. This is done to prevent unnecessary bloat in the DGS storage layer and avoid performance degradation. Any data item whose status is set to DELETED is evaluated to see if it is a candidate for deletion. An item is a candidate for deletion if it has no children. An item that is a candidate for deletion will be removed from the DGS storage.

The cleanup process affects only the active information managed by the DGS Storage. Historical events are retained indefinitely.

The pruning process is invoked periodically. It will only clean up data items that have been in the DELETED status for a minimum amount of time. By default, this minimum time is 30 seconds. This allows enough time for asynchronous processes in a deployment to still link up to the data item if necessary (e.g. derived data items). In some cases, it may be necessary to change the minimum time before an item is cleaned up. This can be configured by creating a blueprint setting called dgs.pruneDelay and setting its value to the number of seconds before a deleted item is eligible for pruning. The prune delay should be greater than 0; a value of 0 or less will be ignored, and the default 30 seconds delay will be used instead.

Manual Cleanup

Filter Criterion

The DGS Filter component is used to apply filters to data items based on rules retrieved from the DGS Backing Store. These rules are referred to as Filter Criterion and are referenced by the filter ID they belong to. This allows multiple Filter Criteria to be assigned to the same filter to provide more thorough filtering of the data. The Filter Criterion are applied in the order of their cardinality.

Below is a JSON representation of a Filter Criteria, followed by a description of each of the fields. Note that not all fields are required for all Filter Criteria.

{
 "filterID": "filter1",
 "cardinality": "1",
 "type": "com.datashyft.pipeline.dataobjects.DatabaseRowData",
 "source": "salary",
 "pattern": "+*",
 "replacement": "REDACTED",
 "min": 10,
 "max": 20,
 "equals": 30,
 "action": "REPLACE"
}

The fields in the Filter Criterion object have the following meanings:

filterID: (Required) The filter group to which this filter criteria belongs.
cardinality: (Required) The cardinality of this filter criteria. Filters are evaluated in cardinality order.
type: (Required) The fully qualified class name of the datatype this criterion will match.
source: The attribute/field that is being checked/modified by this rule.
pattern: The Regular Expression pattern that the source fields value must match for this rule to apply. Often used in conjunction with the replacement attribute.
replacement: The value that should be substituted in place of the matched pattern in the source field. This attribute requires the pattern attribute to be specified.
min: The minimum value that the source field must have. This limit only works for source values that are either numeric or strings that contain numeric values.
max: The maximum value that the source field must have. This limit only works for source values that are either numeric or strings that contain numeric values.
equals: The exact value that the source field must have. This limit only works for source values that are either numeric or strings that contain numeric values.
action: (Required) The action to take if this rule matches. The possible actions are ALLOW, DISCARD, REPLACE, REMOVE.

The ALLOW action tells the DGSFilter that data items that match the rule's attributes should be allowed through the filter. They will be passed downstream without any changes. Once a rule with an ALLOW action is matched, all further criteria processing stops for that data item.

The DISCARD action tells the DGSFilter that data items that match the rule's attributes should be discarded and not allowed through the filter. Once a rule with a DISCARD action is matched, all further criteria processing stops for that data item.

The REPLACE action tells the DGSFilter that the value of the source field of the data item should have any matching patterns in it replaced with the value of the replacement attribute. Filter criteria processing will continue.

The REMOVE action tells the DGSFilter that the source fields value should be removed from the data item completely. Filter criteria processing will continue.

If a data item is not explicitly ALLOWed or DISCARDed by a matching rule, the data item is subject to the default action on the DGS Filter to determine its fate (either ALLOW or DISCARD).

Configuring Data Governance Resources in DataShyft

Once the database schema has been created in your database system, DataShyft needs to be configured to access it. This section covers the steps required in DataShyft to configure it to access your Data Governance database. To complete the configuration, you will need the following information:

Hostname and port of the database server containing the data governance database.
Name of the database containing the data governance schema
The name of the Data Governance schema, typically DataGovernanceSystem
The username and password to use when authenticating with the database server. This account should have full access to the Data Governance schema.

The following steps will walk you through the process of creating a Data Governance Storage resource in DataShyft. Once created, this resource is sued to configure pipelines and deployments and enable data governance.

Login to your DataShyft tenant.
Navigate to the Resources tab.
Click the Add Resource button to bring up the resource creation view.
Under Resource Type, select Data Governance Storage.
Under Sub Type, select the specific database server type (e.g. PostgreSQL or MS SQL).
Provide a Display Name to identify this resource elsewhere in the UI, and optionally a description.
Fill in the hostname of the database server, and, if necessary, the port on which it is listening if it isn’t using the default port.
Enter the name of the database where the Data Governance schema was created.
Enter the name of the Data Governance schema, typically DataGovernanceSystem.
Select the credentials to use to access the database. If you need to add new credentials to access the database, click the provided link to create a new credential. Once created, it can be selected from the drop down.
Click Save New Resource to complete the configuration of the Data Governance Storage resource.

Configuring a Blueprint to use Data Governance

Once the configuration of the Data Governance Storage resource is complete, you can configure blueprints to enable Data Governance features. This is done on a per blueprint basis. Instructions on enabled Data Governance in your blueprint are shown below.

Login to your DataShyft tenant.
Navigate to the Blueprint tab.
Select the blueprint on which you wish to enable Data Governance.
Select the Open button to open the blueprint editor.
In the top right corner of the editor is a red label indicating that Data Governance is disabled for the blueprint. Click on this label to access the blueprint settings.
In the Data Governance dropdown at the top of the Blueprint Settings dialog, select the Data Governance Storage resource that defines where your Data Governance Database is located.
Once selected, click the Updated button to confirm the change.
The red label in the top right corner of the editor will turn green, indicating that Data Governance is now enabled for this blueprint.

Common Data Governance Queries

This section covers common queries of the Data Governance database tables that return useful information regarding the status of tracked data items.

Active Deployment Data Items Statuses

This query returns the number of items in each status by system (i.e., custodian). This query allows you to quickly determine where data items currently are and how the deployment is progressing.

SELECT DIS.status, DI.custodian, COUNT(*) FROM dataItems AS DI
LEFT JOIN dataItemStatuses AS DIS
ON DI.id = DIS.dataId
WHERE DI.context='<contextID>'
GROUP BY DIS.status, DI.custodian;

Deployment Historical Status

The query will return a summary count of all the event status for all custodians for the deployment.

SELECT E.custodian, E.status, COUNT(E.*) 
FROM testdatagovernancesystem.events AS E
WHERE E.context=<contextID>'
GROUP BY E.custodian, E.status

Purging Historical Data

To purge data about a specific deployment (i.e., context), you can execute the following query. This query deletes all historical data about the deployment.

DELETE FROM events
WHERE context = '<contextID>';

Purging Active Deployment Data

To purge data about a specific deployment (i.e., context), you can execute the following query. This query deletes all data about the deployment from the active tables.

DELETE FROM dataItems
WHERE context = '<contextID>';