Skip to content

Latest commit

 

History

History
31 lines (18 loc) · 1.9 KB

README.md

File metadata and controls

31 lines (18 loc) · 1.9 KB

CleanR

Hell is other people data

As I need to unify and compile large datasets coming from different people, I want to share the lessons learned, and some self-made code to help on the process. Here there are some heuristics.

  • Start by creating a template on how your data should look like. Define accepted categories for factors and possible values for numeric variables. Next, compare if your dirty dataset fulfill the categories accepted, and values are within the expected range.

  • Continue by harmonizing column names, order and units. R has several convenient functions to transform latitudes and longitudes, as well as dates.

  • Finally, check species names for errors and misspellings. The fastest way for me is to:

  1. Have a reference list of accepted names: species_tesaurus.csv
  2. Start by triming whitespaces e.g. data$Genus <- trimws(data$Genus) and other heuristics like "sp." etc...
  3. Subset only mismatches with the reference list. e.g.
    mis <- data$Genus_species[which(!data$Genus_species %in% species_tesaurus$Genus_species)] mismatches <- unique(mis)
  4. Always save both the original and the new fixed names in different columns.
  5. You can try fuzzy matching to fix mismatches and taxize for to check for synonyms, etc...

A worked example in the vignette:

I coded a bunch of functions to make this process efficient. Those are in most cases wrappers to existing functions that help my workflows. I use base R (no tydiverse here) because this is what I am more familiar with. I know there are better functions out there to work with e.g. dates, but I would need to re-learn those.

A guided example in https://htmlpreview.github.io/?https://github.com/RadicalCommEcol/CleanR/blob/main/doc/example.html (the help files are not developed yet, as this is work in progress)

Package can be installed at

devtools::install_github("RadicalCommEcol/CleanR", build_vignettes = TRUE)