From the course: Azure Data Engineer Associate (DP-203) Cert Prep: 1 Design and Implement Data Storage

Design for data pruning

- [Instructor] Let's walk through how to design for data pruning, which is an advanced technique that can save you a lot of time with big data workflows. In a nutshell, what this means is on the Azure platform, by default, by using the Databricks Runtime 6.1 or above, it's controlled with this configuration option, dynamicPartitionPruning. And when this is set to be true, it allows you to skip through files that aren't necessary for your calculation. And in a nutshell, this represents the ability to dramatically speed up big data queries. So let's take a look at what this looks like in a chart from Microsoft Learn here. You can see that the large DFP off, right, the data pruning algorithm has been disabled, and that's in the turquoise. And then in this orange here, you can see when it's on how much less time it takes to actually query this data. So one of the big takeaways here is by using the default of the available Databricks Runtime 6.1, you can actually enable a very substantial amount of skipping of files that aren't necessary for that big data operation.

Contents