Here on our sixth exercise we will step it up a notch and start to use some more common Big Data tools, in this case Spark and PySpark.
-
Change directories at the command line to be inside the
Exercise-6
foldercd Exercises/Exercise-6
-
Run
docker build --tag=exercise-6 .
to build theDocker
image. -
There is a file called
main.py
in theExercise-6
directory, this is where youPython
code to complete the exercise should go. -
Once you have finished the project or want to test run your code, run the following command
docker-compose up run
from inside theExercises/Exercise-6
directory
There is a folder called data
in this current directory, Exercises/Exercise-6
. Inside this
folder there are two .zip
'd csv
files, they should remain zipped for the duration of this
exercise.
Generally the files look like this ...
trip_id,start_time,end_time,bikeid,tripduration,from_station_id,from_station_name,to_station_id,to_station_name,usertype,gender,birthyear
25223640,2019-10-01 00:01:39,2019-10-01 00:17:20,2215,940.0,20,Sheffield Ave & Kingsbury St,309,Leavitt St & Armitage Ave,Subscriber,Male,1987
25223641,2019-10-01 00:02:16,2019-10-01 00:06:34,6328,258.0,19,Throop (Loomis) St & Taylor St,241,Morgan St & Polk St,Subscriber,Male,1998
Your job is to read this files with PySpark
and answer the following questions. Each question
should be output as a report in .csv
format in a reports
folder.
- What is the
average
trip duration per day? - How many trips were taken each day?
- What was the most popular starting trip station for each month?
- What were the top 3 trip stations each day for the last two weeks?
- Do
Male
s orFemale
s take longer trips on average? - What is the top 10 ages of those that take the longest trips, and shortest?
Note: Your PySpark
code should be encapsulated inside functions or methods.
Extra Credit: Unit test your PySpark code.