Almost all data platforms start with a change data capture (CDC) service to extract data from an organisations transactional databases — the source of truth for their most valuable data. That data is then transformed, joined, and aggregated to drive analysis, modelling, and other downstream services.
However, this data has not been designed for these use cases — it has been designed to satisfy the needs and requirements of the source applications and their day-to-day use. It can then take significant effort to transform, deduce, and derive the data in order to make it useful for downstream use cases.
Furthermore, breaking changes and data migrations will be a regular part of the applications evolution and will be done without knowledge of how it has been used downstream, leading to breaking changes affecting key reports and data-driven products.
For downstream users to confidently build on this valuable data, they need to know the data they are using is accurate, complete, and future proof. This is the data contract.
Introducing data contracts
When something happens in a source application that produces a change, we should consider how that change would be consumed by downstream users and produce an event/record of that change. The structure of this event will be defined not by how that change is represented in your database, but instead by how we can make it as easy to consume as possible by downstream users, including those we know of, those we anticipate in future, and those we may never know of.
These events would have a strongly defined schema. This schema is a contract between us and our downstream users. We treat this contract with the same care as we treat our applications APIs (i.e. the contract with our customers). We document it, we version it, and we try our best to avoid any breaking changes.
As an example, imagine we have an e-commerce application. Using a CDC service you might get a feed of the orders table, payments, customers, order events, etc. All downstream consumers have to join those together using the same logic as the generating service. As the application evolves, maybe because we need to start charging VAT or we’ve expanded in to different markets, the downstream logic, reports and services all have to be kept in sync, else they start failing (perhaps silently).
Instead, by adopting the concept of data contracts, an event will be generated when an order is placed and will contain all the fields that are associated with it (e.g. the customer and payment details). Further events will be generated as that order progresses. The structure of these events are not tied to the transactional database and should remain compatible as the transactional database evolves. Downstream users no longer need to maintain matching logic and data models and can confidently and quickly build upon this data.
While the example was small, it highlights a common problem almost all users of a data platform are struggling with right now. This is from Shopify’s recent article on their CDC setup:
One of the unfortunate aspects of CDC is the tight coupling of the external event consumers to the internal data model of the source. Breaking changes and data migrations can be a normal part of application evolution. By allowing external consumers to couple on these fields, the impact of breaking changes spreads beyond the single codebase into multiple consumer areas.
As they say, the impact of this problem can be massive, from misreported revenue, to poorly performing fraud models, to failures and incidents in downstream services.
We’ve been living with these problems for years, building tools to solve the symptoms of these problems by deploying services such as data cataloguing, lineage, and automated testing tools. While these tools still have a role to play, they shouldn’t be there solely to catch and alert on upstream data issues as if that is something that could never be addressed.
If you’re in a position to do so, change your upstream processes to create data as consumable events that can confidently be built on top of. Treat data as a first-class output of your service, and guarantee it with a data contract.
Originally published at https://andrew-jones.com/blog/data-contracts/.
Cover image from Unsplash.