🚀 I am thrilled to share that I completed a stellar data engineering bootcamp at DataExpert.io with a successful capstone project and published my first ever Medium article that I am beyond excited to share with you all!
Here are some highlights of the journey and how I accomplished it:
⚖ The Dream Team: Through the amazing community of fellow data nerds at the bootcamp, I met my fantastic project partner, Aayushi Beniwal. She shares my passion for data and technology and is also a nature lover keen on positively impacting the planet. Therefore, we teamed up to architect an end-to-end data pipeline using NYC bike-share data, perfectly aligned with our objectives. We outlined tasks and timelines and proactively shared updates to address any blockers and stay on top of our plan.
🎯 Objectives
The project aims to provide near real-time updates on NYC Citibike’s station status and bike locations to operational teams
Additionally, the project seeks to analyze historical trends using over 10 years (~2.2 GB) of data on trips taken, focusing on their seasonality, peak times, and stations with the highest bike usage
🔗 Github repo: https://bit.ly/3XZpvGe
🛠 Design and Execution
Part I - Streaming Data Pipeline (https://bit.ly/3W49MmI)
This was achieved by setting up Kafka topics to ingest data from 2 NYC bike-share data feeds as JSON APIs in real-time. The volume of incoming data was ~289 MB per day
Processed this huge volume of data using Spark Streaming and PySpark in Databricks.
Utilized micro-batch architecture to stream transformed data to the analytics layer, effectively reducing the load on executors and significantly cutting down processing costs
Part II - Historical Trend Analysis (https://bit.ly/4ctSdnb)
Created a detailed dimensional model and Processed ~93 MM (2.2 GB) of raw data into Snowflake using DBT transformations like created models for dim_stations, and fact_trips, added macros to calculate cost of each trip taken, and created snapshot (SCD) tables for tracking changes in prices over time
Orchestrated DBT transformations using Dagster for dependency monitoring and error handling
Data Quality was the highest priority for both pipelines and utilized frameworks like chispa (python lib) for column and dataframe integration testing in Spark and dbt_expectations with custom and generic tests for analytics pipeline.
Data Integrity was especially critical in real-time data pipeline which was handled by storing checkpoints and watermarks for late-arriving events in Spark Streaming for failure recovery, and schema registry for strict data contracts in Kafka.
A huge shoutout to everyone at DataExpert.io who contributed to the success of this bootcamp and especially Zach Wilson for always inspiring us to bring out the best in us.