A contract-based data platform
Data contracts are powerful enough you can build a data platform based on them — one that is more flexible and more effective than those we built before and promotes a more decentralised approach to data.
The key is the metadata contained within the data contract, which describes the data. We build on that metadata — not the data itself — ensuring that any data that is described in the same will be compatible with our platform.
To take a small example, we might have a service that takes data from a stream (Pub/Sub, Kafka, etc) and writes that to our data warehouse (BigQuery, Snowflake, etc). It’s easy to make that specific to how this particular data is structured and add logic that assumes that structure.
Then along comes another stream you want to archive to your data warehouse, or another team has the same need and would rather not build it themselves. But because you built it with those assumptions the service is too specific to be used for other types of data without adding logic for each of those types.
Instead, you can build a more generic version of that service that can work with any data that is defined with your data contract. The data contract could ensure what you need is present in the metadata (for example, where the data should be written, how often to pull from the stream, where alerts should be sent if data is delayed, and so on), and you can build your service using that metadata to parse the data and take the right actions on the data.
That’s a small example, but it’s easy to see how it can be taken further, by capturing whatever we need to capture in the data contract to make it just as easy to build more advanced services.
For example, you could categorise your data so you know which fields contain personal data, whether it is for internal use, confidential or secret, what the retention period is, etc, and using that it’s actually quite easy to build services that can automate data governance tasks such as anonymisation, access controls, the expiration of data, and so on.
You might then ask yourself, should we be managing this data centrally, or is it better managed by those who understand the data and the requirements they have for that data? Those who if, for example, the data was not being published to the data warehouse due to an issue, would have the most context on how to investigate that, know how critical that is, who to notify, and so on?
And that’s how you start building a more self-served data platform that promotes decentralisation and gives data generators autonomy over how they provide their data to the rest of the organisation.
For more on this, I have a chapter in my book that takes a data contract and deploys some infrastructure and an example service. Just a single chapter — about 25 pages in all, including all the instructions and explanations — so this really isn’t difficult to build 🙂
This post was originally published to my daily newsletter on December 1st 2023. Sign up here for posts about data engineering, data platforms and data quality, every day!