This project consists of tasks that aim to test approaches to data wrangling and cleaning. Each task has its own analysis document which answers various questions using cleaned data obtained by programmatically processing the raw data.
Two tasks were chosen:
- Task 1 - Decathlon Data
- Task 4 - Halloween Candy Data
The code is written in R
and both tasks contain RStudio .Rproj
files.
Both tasks require that a cleaning script is run prior to attempting to run the analysis
The cleaning script will be found at
data_cleaning_scripts/cleaning.R
Open the RStudio .Rproj
and run cleaning.R
. This script will generate new
clean CSV data files in the clean_data
folder. If this step has completed
successfully open the relevant task analysis .Rmd
file in the documentation_and_analysis
folder.
CodeClan provided test data to students but due to file size concerns the original source data is not included in this repository for task 4. Similarly the clean data generated by the cleaning script is not uploaded but can be generated from the code.
CodeClan staff can find the source data files in CodeClan repository
dr22_classnotes/week_03/day_5/dirty_data_project_raw_data/candy_ranking_data
For those outside CodeClan the data can be obtained from the following sources
decathlon: Performance in decathlon (data).
Department of statistics and computer science, Agrocampus Rennes
N.B. A copy of this file is in task1/raw_data/decathlon.rds
So Much Candy Data, Seriously. University of British Columbia
N.B. Before attempting to run any cleaning/analysis scripts the three source
data files required should be copied to folder task4/raw_data
. Full details in the analysis document.
The following R packages are required to run the code. The version numbers used at the time of the original project are shown.
Package | Version used for analysis |
---|---|
janitor |
"2.2.0" |
tidyverse |
"2.0.0" |
Package | Version used for analysis |
---|---|
assertr |
"3.0.0" |
here |
"1.0.1" |
readxl |
"1.4.3" |
tidyverse |
"2.0.0" |