Skip to content

nrennie/messy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

R-CMD-check

messy

When teaching examples using R, instructors often using nice datasets - but these aren't very realistic, and aren't what students will later encounter in the real world. Real datasets have typos, missing values encoded in strange ways, and weird spaces. The {messy} R package takes a clean dataset, and randomly adds these things in - giving students the opportunity to practice their data cleaning and wrangling skills without having to change all of your examples.

Installation

Install from GitHub using:

remotes::install_github("nrennie/messy")

Usage

messy()

set.seed(1234)
messy(ToothGrowth[1:10,])
     len supp dose
1    4.2   vc  0.5
2   11.5   VC  0.5
3    7.3   VC  0.5
4    5.8   VC 0.5 
5    6.4   VC  0.5
6     10   VC  0.5
7  11.2    VC  0.5
8   11.2   VC  0.5
9    5.2   VC  0.5
10     7 <NA> <NA>

Increase how messy the data is:

set.seed(1234)
messy(ToothGrowth[1:10,], messiness = 0.7)
    len supp dose
1  <NA> <NA> 0.5 
2  <NA> <NA> <NA>
3  <NA> <NA> <NA>
4  <NA> <NA> <NA>
5  <NA> <NA> <NA>
6   10  <NA>  0.5
7  <NA> <NA> <NA>
8  <NA> <NA>  0.5
9  5.2   VC   0.5
10   7  <NA> <NA>

add_whitespace()

Randomly adds a whitespace to the ends of some values, meaning that numeric columns may be converted to characters:

set.seed(1234)
add_whitespace(ToothGrowth[1:10,])
     len supp dose
1    4.2   VC  0.5
2   11.5   VC  0.5
3    7.3   VC  0.5
4    5.8   VC 0.5 
5    6.4   VC  0.5
6     10   VC  0.5
7  11.2    VC  0.5
8   11.2   VC  0.5
9    5.2   VC  0.5
10     7   VC 0.5 

Apply to only some columns:

set.seed(1234)
add_whitespace(ToothGrowth[1:10,], cols = "supp")
    len supp dose
1   4.2   VC  0.5
2  11.5   VC  0.5
3   7.3   VC  0.5
4   5.8   VC  0.5
5   6.4   VC  0.5
6  10.0   VC  0.5
7  11.2  VC   0.5
8  11.2   VC  0.5
9   5.2   VC  0.5
10  7.0   VC  0.5

change_case()

Randomly switches the case between upper case, lower case, and no change of character or factor columns:

set.seed(1234)
change_case(ToothGrowth[1:10,], messiness = 0.5)
    len supp dose
1   4.2   vc  0.5
2  11.5   VC  0.5
3   7.3   VC  0.5
4   5.8   VC  0.5
5   6.4   VC  0.5
6  10.0   VC  0.5
7  11.2   vc  0.5
8  11.2   vc  0.5
9   5.2   VC  0.5
10  7.0   VC  0.5

make_missing()

Randomly make some values missing using NA:

set.seed(1234)
make_missing(ToothGrowth[1:10,])
    len supp dose
1   4.2   VC  0.5
2  11.5   VC   NA
3   7.3   VC  0.5
4   5.8   VC  0.5
5   6.4   VC  0.5
6  10.0   VC  0.5
7    NA   VC  0.5
8  11.2   VC   NA
9   5.2   VC  0.5
10  7.0   VC  0.5

Add a different missing value representation for some columns:

set.seed(1234)
make_missing(ToothGrowth[1:10,], cols = "supp", missing = "999")
    len supp dose
1   4.2   VC  0.5
2  11.5   VC  0.5
3   7.3   VC  0.5
4   5.8   VC  0.5
5   6.4   VC  0.5
6  10.0   VC  0.5
7  11.2  999  0.5
8  11.2   VC  0.5
9   5.2   VC  0.5
10  7.0   VC  0.5

Combining functions

You can pipe together multiple functions to create custom messy transformations:

set.seed(1234)
ToothGrowth[1:10,] |> 
  make_missing(cols = "supp", missing = " ") |> 
  make_missing(cols = c("len", "dose"), missing = c(NA, 999)) |> 
  add_whitespace(cols = "supp", messiness = 0.5)
    len supp dose
1   4.2   VC  0.5
2  11.5  VC    NA
3   7.3   VC  0.5
4   5.8  VC   0.5
5   6.4  VC   0.5
6  10.0   VC  0.5
7  11.2       0.5
8  11.2   VC   NA
9   5.2   VC  0.5
10  7.0  VC   0.5

About

R package to make a data frame messy and untidy.

Topics

Resources

License

Stars

Watchers

Forks

Languages