Data Governance Fundamentals

DataShyft provides built-in data governance capabilities that allow data flowing through deployments to be tracked and monitored. When Data Governance is enabled, the deployment will register every data item that it handles and track its progress through the pipeline. If the data is stored in an external system, such as a database or FTP server, that operation is recorded in the Data Governance System along with meta information about where it was stored and how to remove it. This allows deployments to trigger the automatic removal of data items by revoking access to those items. This triggers the data removal process defined by the deployment to remove the data items from the places where they have been stored.

Data Governance is enabled within a blueprint by telling the blueprint what Data Governance Storage Resource to use for storing and tracking the data.

Components provided with DataShyft automatically perform appropriate Data Governance tracking operations as appropriate for that component. Details on the specific data governance actions taken by each component can be found under the description of each component in this document.

Data Provenance

The Data Governance System tracks the provenance of data within a deployment. This allows the relationship between data items to be tracked and for the revocation of a parent item to automatically trigger the revocation of child items. There are two ways to create this parent child relationship.

First, when data is moved from one system to another, it should be passed into a DGS Receiver component on the destination system. This will create a record indicating that the data item on the destination system was derived from the data item on the source system.

Second, if an unregistered data item is passed into a DGS Registrar component and it has a tag on it called dgsParents , the item is registered as normal, and a record is created for each listed parent indicating that the newly registered item was derived from that parent. In this way, it is possible to have a data item that is derived from multiple parents. This would be done if you used a Transform to convert one data object into another to record the relationship between the transformed object and the original object.

The data provenance records are used to cascade the revocation of data when the status of a data item is set to REVOKED. The Status Updater will follow the parent/child relationships all the way down the tree and revoke any data item that is a descendant of the original item.

DGS Context

DGS components operate within a context. All components in a single deployment use the same context so that they can share state with each other. The default value for the context is the ID of the deployment. This ensures that multiple deployments don’t interfere with each other’s operation.

In some cases, however, it may be necessary to have to have multiple deployments that use the same context. This can include situations where a deployment is restated and needs to resume operations from where it was previously, or situations where multiple different deployments are operating in concert with one another. In these cases, the deployments’ context can be specified using the dgs.context blueprint setting. The value given to this setting is used as the context for all deployments created from this blueprint. If the same value is used in different blueprints, then deployments from each of those blueprints will operate in the same shared context.

DGS Cleanup

The Data Governance System will automatically consider cleaning up data for data items whose status has been set to DELETED. This is done to prevent unnecessary bloat in the DGS storage layer and avoid performance degradation. Any data item whose status is set to DELETED is evaluated to see if it is a candidate for deletion. An item is a candidate for deletion if it has no children. An item that is a candidate for deletion will be removed from the DGS storage.

The cleanup process affects only the active information managed by the DGS Storage. Historical events are retained indefinitely.

The pruning process is invoked periodically. It will only clean up data items that have been in the DELETED status for a minimum amount of time. By default, this minimum time is 30 seconds. This allows enough time for asynchronous processes in a deployment to still link up to the data item if necessary (e.g. derived data items). In some cases, it may be necessary to change the minimum time before an item is cleaned up. This can be configured by creating a blueprint setting called dgs.pruneDelay and setting its value to the number of seconds before a deleted item is eligible for pruning. This value should be greater than 0, and defaults to 30 seconds if it is not.

Data Governance Fundamentals

Data Provenance

DGS Context

DGS Cleanup

Related Articles