From the course: Azure Data Engineer Associate (DP-203) Cert Prep: 1 Design and Implement Data Storage
Design for data pruning - Azure Tutorial
From the course: Azure Data Engineer Associate (DP-203) Cert Prep: 1 Design and Implement Data Storage
Design for data pruning
- [Instructor] Let's walk through how to design for data pruning, which is an advanced technique that can save you a lot of time with big data workflows. In a nutshell, what this means is on the Azure platform, by default, by using the Databricks Runtime 6.1 or above, it's controlled with this configuration option, dynamicPartitionPruning. And when this is set to be true, it allows you to skip through files that aren't necessary for your calculation. And in a nutshell, this represents the ability to dramatically speed up big data queries. So let's take a look at what this looks like in a chart from Microsoft Learn here. You can see that the large DFP off, right, the data pruning algorithm has been disabled, and that's in the turquoise. And then in this orange here, you can see when it's on how much less time it takes to actually query this data. So one of the big takeaways here is by using the default of the available Databricks Runtime 6.1, you can actually enable a very substantial amount of skipping of files that aren't necessary for that big data operation.
Contents
-
-
-
(Locked)
Design an Azure Data Lake solution1m 43s
-
(Locked)
Recommend file types for storage1m 37s
-
(Locked)
Recommend file types for analytical queries2m 1s
-
(Locked)
Design for efficient querying2m
-
Design for data pruning1m 16s
-
(Locked)
Design a folder structure2m 28s
-
(Locked)
Design a distribution strategy1m 54s
-
(Locked)
Design a data archiving solution2m 8s
-
(Locked)
-
-
-
-
-