mega-boost

ETL pipelines that take BOOST-coded/raw microdata for each country, clean, harmonize, and store them in a single Delta table. Aggregated budget and expenditure data at various levels are also created for query, visualization, and reporting purposes.

Cross-Country Harmonization

BOOST data is country-specific — for instance, some ministries of finance have IT systems that record detailed expenditure data for every provider of public services, while others may only have data aggregated at the district levels. The harmonization work in this repo standardizes the labeling of common BOOST features on the fiscal data at the most disaggregated level available to maximize the flexibility for analysis. The resulting data is stored in a table named boost_gold. The column definitions are detailed at the bottom of this file.

Code Organization

Each country's ETL pipeline code resides in its own folder.
cross_country_aggregate_dlt.py vertically stacks all countries' micro-level data into the boost_gold table and creates aggregates at various levels.
/quality contains the ETL code for processing BOOST Cross Country Interface (CCI) Excel files' "Approved" and "Executed" sheets for all available countries. The resulting tables are used for both:
- Programmatic quality checks in cross_country_aggregate_dlt.py, and
- Manual review and resolution of discrepancies along critical dimensions (total, functional, economic, etc.) in the Quality Dashboard

How to Add a New Country

Create a new folder named using the country name.
Write and test the country-specific ETL code. You may organize the code as follows:

A notebook for extracting the BOOST/raw data from the Excel sheet(s) (e.g. ALB_extract_microdata_excel_to_csv). The resulting CSV files can then be loaded directly by a subsequent Delta Live Table (DLT) script (described in the next point). If the raw data is already in a format supported by DLT load, such as CSV, you may skip this step/notebook altogether. To test the code, run the notebook in Databricks with the assigned project compute cluster (e.g. DAP-PROD-CLUSTER).
A DLT notebook for cleaning and transforming the BOOST/raw data. The resulting data should have each line item tagged with a predefined set of tags in a consistent manner (e.g. ALB_transform_load_dlt). To validate the DLT code and verify the resulting table schema, run the notebook with the default cluster as you would a normal notebook. To populate the tables, create a DLT pipeline with boost_intermediate as the target schema name. Reference an existing country's DLT pipeline settings for other configurations.
Be sure to follow the naming convention (code folder & file names, table names, pipeline names, etc.) referencing existing countries. When a country code is needed, use the 3-letter ISO 3166 country codes.

Add the new country's ETL pipeline steps to the "BOOST Harmonize" job: Workflows > Jobs > BOOST Harmonize > Tasks > Add task

Add the extraction step using Type: Notebook, and Source: git. Use the default cluster as the compute.
Add the DLT step using Type: Delta Live Table pipeline, and select your DLT pipeline.
Check that the step dependencies are configured correctly.

Source the official data, code, and add the ETL pipeline for the country's admin1 level subnational population. Skip this step if the BOOST country data does not have subnational tagging.

Find official statistics and projections from the country's national statistics office if available. Otherwise, check if census.gov might provide subnational population statistics for the given country. Nigeria's subnational population estimates come from census.gov, for example.
Add the ETL code to the mega-indicator repo because it's a public good by itself. Make sure the region names in the subnational population dataset align with the BOOST subnational names for the given country, as the two datasets will be joined on the subnational region names.
Add the country's ETL script to the subnational population pipeline "Indicators on Demand": Workflow > Jobs > Indicators on Demand > Tasks > Add task.
Global Data Lab's subnational human development datasets are used as some outcome indicators. Align subnational region names with those of population and BOOST if necessary.

Add the country to the cross-country aggregation DLT pipeline

Update cross_country_aggregate_dlt.py to add the intermediate table (containing only the new country's cleaned microdata) as a source for stacking into boost_gold.
Execute the "BOOST Agg" DLT workflow to perform the stacking and aggregation: Workflows > Delta Live Tables > BOOST Agg > Start.
The BOOST Agg pipeline has built-in quality checks. If the job fails due to failed quality checks, investigate and fix them.

Check, investigate, and resolve discrepancies with the pre-existing BOOST data workflow

Visit the PFM BOOST Data Quality Dashboard.
Filter Countries to the newly added country. Check that the year coverage range and number of functional & economic category coverage are as expected.
Further filter by Discrepancy %, set min as 1% and 5%, and investigate discrepancies at various dimensions (e.g., func/econ/total). Resolution may involve an iterative process of updating the pre-existing BOOST Excel workflows and/or updating the ETL pipeline code.
Our current acceptable threshold of discrepancy is 5%. If there needs to be a fix in the BOOST CCI EXCEL sheets, docuemnt the content of the changes and leave the logs in the following document: BOOST CCI changes reconciliations.
Make sure that the EXCEL is updated at both BOOST's shared project folder and MEGA project folder for databricks processing

Check the PowerBI report to ensure the new country is reflected and its narratives are correctly presented.

boost_gold

Column Name	Data Type	Description	Possible Values	Possible Value Catalog Query
country_name	string	The name of the country for which the budget data is recorded	e.g., "Kenya", "Brazil". See wbgapi or WB website for a full list	`SELECT DISTINCT country_name FROM indicator.country`
year	int	The fiscal year for the budget data	e.g., 2023, 2024
admin0	string	Who spent the money at the highest administrative level	Either "Central" or "Regional"
admin1	string	Who spent the money at the first sub-national administrative level	If admin0 is "Regional," this is the state or province name; if admin0 is "Central," this is always "Central Scope." "Scope" is postfixed to avoid conflicting with a subnational region named "Central" in some countries	`SELECT DISTINCT admin1_region FROM indicator.admin1_boundaries_gold`
admin2	string	Who spent the money at the second sub-national administrative level	For "Central" spending, this is the government agency/ministry name; for "Regional" spending, this can be the district or municipality name or subnational government agency/department name
geo1	string	Geographically, at which first sub-national administrative level the money was spent	Same as admin1 possible values	`SELECT DISTINCT admin1_region FROM indicator.admin1_boundaries_gold`
func	string	Functional classification of the budget	e.g., Health, Education. See Classification of the Functions of Government (COFOG) for a full list	`SELECT DISTINCT func FROM boost_intermediate.quality_functional_silver`
func_sub	string	Sub-functional classification under the main COFOG function	e.g., "primary education", "secondary education"
econ	string	Economic classification of the budget	e.g., "Wage bill", "Goods and services"	`SELECT DISTINCT econ FROM boost_intermediate.quality_economic_silver`
econ_sub	string	Sub-economic classification under the main economic category	e.g., "basic wages", "allowances"
approved	double	Amount of budget in current local currency approved by the relevant authority	Numeric value representing local currency
revised	double	Revised budget in current local currency amount during the fiscal year	Numeric value representing local currency
executed	double	Actual amount spent in current local currency by the end of the fiscal year	Numeric value representing local currency

Name		Name	Last commit message	Last commit date
Latest commit History 258 Commits
Albania		Albania
Bangladesh		Bangladesh
Bhutan		Bhutan
Burkina_Faso		Burkina_Faso
Colombia		Colombia
Congo_DR		Congo_DR
Kenya		Kenya
Mozambique		Mozambique
Nigeria		Nigeria
Pakistan		Pakistan
Paraguay		Paraguay
Tunisia		Tunisia
Uruguay		Uruguay
auxiliary_data		auxiliary_data
quality		quality
README.md		README.md
cross_country_aggregate_dlt.py		cross_country_aggregate_dlt.py
publish.sql		publish.sql
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mega-boost

Cross-Country Harmonization

Code Organization

How to Add a New Country

boost_gold

About

Releases

Packages

Contributors 4

Languages

dime-worldbank/mega-boost

Folders and files

Latest commit

History

Repository files navigation

mega-boost

Cross-Country Harmonization

Code Organization

How to Add a New Country

boost_gold

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages