Skip to content

Github du cours Données sur le cloud de l'Executive Master Statistiques et Big Data

License

Notifications You must be signed in to change notification settings

philippereal/data-in-the-cloud

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

data-in-the-cloud

Github du cours Cloud computing de l'Executive Master Statistiques et Big Data

Link to the course slides

Airflow Lab Session

The idea is to wrap an already made model in a PythonOperator that takes the downloaded data file as a parameter, and the model output location as a parameter.

Instructions

Create an empty DAG file that will do the following:

  • Download the dataset from S3 to a known path: https://sbd-data-in-the-cloud.s3.eu-west-3.amazonaws.com/petrol_consumption.csv.
  • Then, triggers a PythonOperator to create a pickle of a trained regression model using the Random Forest algorithm: here is the code to adapt. You should be able to pass the filepath to the dataset as an argument to the PythonOperator in addition to the model pickle desired location.
  • Then, uploads the model pickle to S3 in a timestamped folder (folder named after the execution date of the pipeline).
  • Eventually, deletes the dataset and the model pickle from local storage.

To do so, you'll need:

  1. To modify the Astronomer image by adding pandas and scikit-learn in your requirements.txt instructions here.
  2. S3Hook, to communicate with S3 (download, upload).
  3. Add a connection in Airflow to be able to store things in my personal S3 bucket, you will set the SECRET_KEY and ACCESS_KEY (I'll give you by DM) in the Airflow web interface in the tab Admin > Connections, to give the permissions to Airflow for managing AWS services on your behalf.
  4. PythonOperator, that will contain the model generator.
  5. Airflow Macros, handy for getting some variables around the execution of the DAG. Useful for outputing in a folder prefixed by a date representing the execution date of the pipeline run.
  6. Using astro dev start you'll be able to run Airflow locally to test your pipeline before deploying in production, for more info.

Add the connection in Airflow UI

Here is an example of macros in use, in this example we delete a dynamically created file (containing the execution date of the pipeline in its name) using a BashOperator.

from airflow.operators.bash_operator import BashOperator

OUTPUT_CSV_FILEPATH = '/PATH/TO/MY_FILE.csv'

dag = DAG(
    "my_dag",
    default_args=default_args,
    max_active_runs=1,
    concurrency=10,
    schedule_interval="0 12 * * *",
)

delete_csv = BashOperator(
    task_id="delete_csv",
    bash_command="rm {}".format(OUTPUT_CSV_FILEPATH.replace(".csv", "{{ ds }}.csv")),
    dag=dag,
)
Note: pickle

Saving a model to disk is as simple as:

import pickle

# save the model to disk
filename = 'finalized_model.sav'
pickle.dump(model, open(filename, 'wb'))

Symmetrically, loading a model from disk:

# load the model from disk
filename = 'finalized_model.sav'
loaded_model = pickle.load(open(filename, 'rb'))

About

Github du cours Données sur le cloud de l'Executive Master Statistiques et Big Data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages