YouTube Data Warehouse

Project Overview

This is a data engineering project designed to collect, store, process, and analyze YouTube channel and video data. The project utilizes various technologies such as PostgreSQL, ClickHouse, DBT, Apache Spark, and Apache Airflow to create a robust data warehouse that provides valuable insights into YouTube data.

Dataset

I used Trending YouTube Video Statistics datasets on Kaggle: https://www.kaggle.com/datasets/datasnaek/youtube-new

What I have done:

Configured Data Models for OLTP Database (PostgreSQL): Designed and set up relational data models to efficiently handle transactional operations
Developed APIs for Parsing YouTube Video and Channel Information: Implemented RESTful APIs to retrieve and process data related to YouTube videos and channels
Created a DAG for Downloading Top 200 YouTube Videos
Implemented an asynchronous mechanism for recording transactions into the database
Configured Kafka: Set up a Kafka producer to regularly trigger the task of parsing the top 200 popular videos and a Kafka consumer to handle the processing of this data
Created docker-compose.yaml File for Deployment.

Project plan

Data Collection:
- Extract data from YouTube API.
- Store raw data in PostgreSQL.
Data Processing with Spark:
- Perform ETL operations on raw data.
- Load processed data into ClickHouse.
Data Transformation with DBT:
- Create staging, integration, and data warehouse layers.
- Define and manage data models.
Pipeline Automation with Airflow:
- Schedule and manage ETL pipelines.
- Ensure timely updates of data.
Data Visualization:
- Use Metabase and Apache Superset to create dashboards and reports.

Getting Started

Prerequisites

Docker (recommended for easier setup of services)
Python 3.7
PostgreSQL
ClickHouse
Apache Spark
Apache Airflow
DBT
Metabase
Apache Superset

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
airflow		airflow
backend		backend
db		db
dbt/linguaspark		dbt/linguaspark
frontend		frontend
kafka		kafka
pipeline		pipeline
src/yt_subs_parsing		src/yt_subs_parsing
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
dev.env		dev.env
docker-compose.yaml		docker-compose.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YouTube Data Warehouse

Project Overview

Dataset

What I have done:

Project plan

Getting Started

Prerequisites

About

Releases

Packages

Languages

kirill505/youtube-data-warehouse

Folders and files

Latest commit

History

Repository files navigation

YouTube Data Warehouse

Project Overview

Dataset

What I have done:

Project plan

Getting Started

Prerequisites

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages