Skip to content

ElNonito/tp_dev_04_2022

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Practices - Data engineering

Have a stackoverflow account : https://stackoverflow.com/

Have a github account : https://github.com/

And a github repo to push your code.

Fork the repo on your own Github account

Docker and Compose

Take time to read and install

https://docs.docker.com/get-started/overview/

docker --version
Docker version 20.10.14

https://docs.docker.com/compose/

docker-compose --version
docker-compose version 1.29.2

TP2 - Functional programming for data engineering

You"re the new data engineer of a scientific team in charge of monitoring CO2 levels in atmosphere, which are at their highest in 800,000 years..

You have to give your best estimate of CO2 levels for 2050.

Your engineering team is famous for taking a great care of the developer experience: using Type, small functions (using .map, .filter, .reduce), tests and logs.

Your goal is to map, parse, filter CO2 concentration levels in the atmosphere coming from an observatory in Hawaii from 1950 to 2022.

CO2 concentration level has been inserted inside utils/ClimateService.

How to write a Scala application ?

Could you install SBT on your machine ? If yes

sbt run

Should give you an implementation is missing error :

(...)
2022-05-16 18:33:48 [run-main-0] INFO  com.github.polomarcus.main.Main$ - Starting the app
[error] (run-main-0) scala.NotImplementedError: an implementation is missing

Same for sbt test

You couldn"t install SBT on your machine

Tips: having trouble to install Idea, SBT or scala? You can use Docker and Docker Compose to run this code and use your default IDE to code or a web IDE https://scastie.scala-lang.org/:

docker-compose build my-scala-app
docker-compose run my-scala-app bash # connect to your container to acces to SBT
> sbt test
# or 
> sbt run

Continuous build and test

Pro Tips : https://www.scala-sbt.org/1.x/docs/Running.html#Continuous+build+and+test

Make a command run when one or more source files change by prefixing the command with ~. For example, in sbt shell try:

sbt
> ~ testQuick

Test Driven Development (TDD) - Write a function and its tests that detect climate related sentence

  1. Look at and update "isClimateRelated" to add one more test test/scala/ClimateService
  2. Look at and update "isClimateRelated" function inside main/scala/com/github/polomarcus/utils/ClimateService

Write a function that use Option[T] to handle CO2 Record

With data coming from Hawaii about CO2 concentration in the atmosphere, iterate over it and find the difference between the max and the min value.

  1. Look at and update "parseRawData" to add one more test test/scala/ClimateService
  2. Look at and update "parseRawData" function inside main/scala/com/github/polomarcus/utils/ClimateService
  3. Create your own function to find the min, max value. Write unit tests and run sbt test Tips:
  1. Create your own function to find the min, max value for a specific year. Write unit tests Tips:
  • Re use getMinMax to create this function :
  • def getMinMaxByYear(year: Int) : (Int, Int)
  1. Create your own function to difference between the max and the min. Write unit tests

Tips:

Iteration - filter

  1. Remove all data from december (12), winter makes data unreliable there, values with filterDecemberData inside main/scala/com/github/polomarcus/utils/ClimateService

Iteration - map

  1. implement showCO2Data inside main/scala/com/github/polomarcus/utils/ClimateService
  2. Make your Main program works using sbt run

Bonus

Estimate CO2 levels for 2050 based on past data.

How would you do if a continuous stream of data come ?

Tips: Batch processing / Stream processing ?

Continuous Integration (CI)

If it works on your machine, congrats !

Test it on a remote servers now thanks to a Continuous Integration (CI) system such as GitHub Actions :

  1. Have a look to the .github/workflows folder and files
  2. Something weird ? Have a look to their documentation : https://github.com/features/actions
  3. Ready to run a CI job ? Go on your Github"s Fork/Clone of this and find the "Action" tab
  4. Find your CI job running
  5. Create a CI workflows using Docker to run the sbt test command (inspiration : https://github.com/polomarcus/television-news-analyser/blob/main/.github/workflows/docker-compose.yml#L7-L17)

Tools

Communication problems

Why Kafka ?

https://kafka.apache.org/documentation/#introduction

Answer these questions with what you can find on the documentation :

  • What problems does Kafka solve ?

  • Which use cases ?

  • What is a producer ?

  • What is a consumer ?

  • What are consumer groups ?

  • What is a offset ?

  • Why using partitions ?

  • Why using replication ?

  • What are In-Sync Replicas (ISR) ?

Try to install Kafka without docker

https://kafka.apache.org/documentation/#gettingStarted

Use kafka with docker

Start multiples kakfa servers (called brokers) by downloading a docker compose recipe :

Check on the docker hub the image used :

Verify

docker ps
CONTAINER ID   IMAGE                             COMMAND                  CREATED          STATUS         PORTS                                                                                  NAMES
b015e1d06372   confluentinc/cp-kafka:7.0.1       "/etc/confluent/dock…"   10 seconds ago   Up 9 seconds   0.0.0.0:9092->9092/tcp, :::9092->9092/tcp, 0.0.0.0:9999->9999/tcp, :::9999->9999/tcp   kafka1
(...)

Getting started with Kafka

  1. Connect to your kafka cluster with 2 command-line-interface (CLI)

Using Docker exec

docker exec -ti my_kafka_container_name bash
> pwd
> kafka-topics 
# will give you help to use this command
> kafka-topics --describe --bootstrap-server localhost:9092 
# will give you an error

Read this blog article to fix Broker may not be available. error : https://rmoff.net/2018/08/02/kafka-listeners-explained/

Pay attention to the KAFKA_ADVERTISED_LISTENERS config from the docker-compose file.

  1. Create a "mailbox" - a topic with the default config : https://kafka.apache.org/documentation/#quickstart_createtopic
  2. Check on which Kafka broker the topic is located using --describe
  3. Send events to a topic on one terminal : https://kafka.apache.org/documentation/#quickstart_send
  4. Keep reading events from a topic from one terminal : https://kafka.apache.org/documentation/#quickstart_consume
  • try the default config
  • what does the --from-beginning config do ?
  • what about using the --group option for your producer ?
  1. stop reading
  2. Keep sending some messages to the topic

Partition

  1. Check consumer group with kafka-console-consumer : https://kafka.apache.org/documentation/#basic_ops_consumer_group
  • notice if there is lag for your group
  1. read from a new group, what happened ?
  2. read from a already existing group, what happened ?
  3. Recheck consumer group

Replication - High Availability

  1. Increase replication in case one of your broker goes down : https://kafka.apache.org/documentation/#topicconfigs
  2. Stop one of your brokers with docker
  3. Describe your topic, check the ISR (in-sync replica) config : https://kafka.apache.org/documentation/#design_ha
  4. Restart your stopped broker
  5. Check again your topic

TP 3 - Kafka Streams to read and write to Kafka

Continuous Integration (CI)

If it works on your machine, congrats. Test it on a remote servers now thanks to a Continuous Integration (CI) system such as GitHub Actions :

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Scala 99.7%
  • Dockerfile 0.3%