FastETL framework, modern, versatile, does almost everything.
Este texto também está disponível em português: 🇧🇷LEIAME.md.
FastETL is a plugins package for Airflow for building data pipelines for a number of common scenarios.
Main features:
- Full or incremental replication of tables in SQL Server, Postgres and MySQL databases
- Load data from GSheets and from spreadsheets on Samba/Windows networks
- Extracting CSV from SQL
- Clean data using custom data patching tasks (e.g. for messy geographical coordinates, mapping canonical values for columns, etc.)
- Using a Open Street Routing Machine service to calculate route distances
- Using CKAN or dados.gov.br's API to update dataset metadata
- Using Frictionless Tabular Data Packages to write OpenDocument Text format data dictionaries
This framework is maintained by a network of developers from many teams at the Ministry of Management and Innovation in Public Services and is the cumulative result of using Apache Airflow, a free and open source tool, starting in 2019.
For government: FastETL is widely used for replication of data queried via Quartzo (DaaS) from Serpro.
FastETL implements the standards for Airflow plugins. To install it,
simply add the apache-airflow-providers-fastetl
package to your
Python dependencies in your Airflow environment.
Or install it with
pip install apache-airflow-providers-fastetl
To see an example of an Apache Airflow container that uses FastETL, check out the airflow2-docker repository.
To ensure appropriate results, please make sure to install the
msodbcsql17
and unixodbc-dev
libraries on your Apache Airflow workers.
The test suite uses Docker containers to simulate a complete use environment, including Airflow and the databases. For that reason, to execute the tests, you first need to install Docker and docker-compose.
For instructions on how to do this, see the official Docker documentation.
To build the containers:
make setup
To run the tests, use:
make setup && make tests
To shutdown the environment, use:
make down
The main FastETL feature is the DbToDbOperator
operator. It copies data
between postgres
and mssql
databases. MySQL is also supported as a
source.
Here goes an example:
from datetime import datetime
from airflow import DAG
from fastetl.operators.db_to_db_operator import DbToDbOperator
default_args = {
"start_date": datetime(2023, 4, 1),
}
dag = DAG(
"copy_db_to_db_example",
default_args=default_args,
schedule_interval=None,
)
t0 = DbToDbOperator(
task_id="copy_data",
source={
"conn_id": airflow_source_conn_id,
"schema": source_schema,
"table": table_name,
},
destination={
"conn_id": airflow_dest_conn_id,
"schema": dest_schema,
"table": table_name,
},
destination_truncate=True,
copy_table_comments=True,
chunksize=10000,
dag=dag,
)
More detail about the parameters and the workings of DbToDbOperator
can bee seen on the following files:
To be written on the CONTRIBUTING.md
document (issue
#4).