Metaplane reposted this
Data quality testing without monitoring pipelines is like running a kitchen without inspecting ingredients. No matter how skilled the chef, bad ingredients ruin the dish. #dataengineering #dataquality #analytics
That's far too clean. You're missing all the silos and the people trying to shoot holes in the infrastructure.
Also too many cooks spoil the broth
xkcd being accurate once again!
Yes checks within your pipeline as monitoring are important, but if you have data quality issues it not you pipeline’s problem. And associating pipelines to data quality often leads to mix both concepts. Yes you should test your ingredients but if you need to put some time (and money) somewhere it usually better to spend some effort to check your provider’s process rather than checking their product once they arrive in the kitchen even if, ideally, you need to do both. Cause if have to cook for thousands people and you realize all of you meat is in fact made of plastic it’s will be a little bit late to order new ingredients. The reality is that we often ask data engineers to solve issues they didn’t created, understand undocumented dynamics within data sources, force them to imagine complex solutions to work arounds terrible decisions people made just because it’s too late to change the provider’s process. Data quality is the responsibility of the data source, the same way people(hopefully) monitoring code quality in their application, they should monitor data quality of data they produce. Data quality is part of the product exposed to your colleagues or to your business partners. PO should be incentivized on this metric too
Not just it ruins the dish, it ruins the pan too. Applying certain data quality checks at source greatly reduces the impact on the dish as well as the pan
The funny part is that most product and top management are oblivious to the detrimental effects of that ingestion pipeline. I would say it's sad also when they are taken off guard by how a small change in the data source schema blows the engine.
One huge python script pulling data from flat files and excel sheets
Constant demand from business to build dashboards and do analytics makes it difficult to invest more time in data quality and observability. Data quality and observability should have been more popular than what it is now.
I don't like this visual representation.
Husband & Father | Data Executive | Creator | Advising Executives on Leveraging Data for Strategic Decisions | Bridging the Gap Between Boardrooms and Tech Teams
1mothere's only one person who understands the ingestion pipeline and won't allow others to make changes because the associated tech debt acts as job security.