Skip to content
This repository has been archived by the owner on May 19, 2021. It is now read-only.

R package to aid cleaning/checking/formatting data using Codebooks/Data Dictionaries #46

Open
petebaker opened this issue Apr 20, 2016 · 6 comments

Comments

@petebaker
Copy link

Subject: R package to aid cleaning/checking/formatting data using Codebooks/Data Dictionaries

Statisticians and scientists producing primary data often have different needs to those scraping secondary and tertiary data off the web.

Often in medical, epidemiological, social, psych and other scientific studies, researchers use codebooks to document data formats, variable labels, factor labels, ranges for continuous variables, details of measuring instruments etc. Sometimes statisticians get a photocopied codebook or pdf but my preference (and that of aware researchers) is a spreadsheet so that these meta data can be used.

For small data sets its probably OK to manually set up factor labels and check for non-defined factor levels and identify out of range values for continuous variables. For data sets with hundreds of variables or when there are many data files with similar structure it is probably better to automate these procedures.

A package for extracting information from codebooks and using the meta data to assist labelling and data cleaning would prove useful to statisticians and scientists at the coal face of producing primary data.

@ivanhanigan
Copy link
Contributor

Hi Peter, this is a direct hit on my main problem/use case for a new R tool!
Do you think #11 and
#18 also cover similar
territory?
I think a 'metadata' subgroup to the AU unconf would be wonderful.
I am attending remotely from Canberra, and have a serious interest in
spending the next couple days working on 'metadata-driven' tools to aid
cleaning/checking/formatting/documenting our data.

On Wed, Apr 20, 2016 at 11:25 PM, Peter Baker [email protected]
wrote:

Subject: R package to aid cleaning/checking/formatting data using
Codebooks/Data Dictionaries

Statisticians and scientists producing primary data often have different
needs to those scraping secondary and tertiary data off the web.

Often in medical, epidemiological, social, psych and other scientific
studies, researchers use codebooks to document data formats, variable
labels, factor labels, ranges for continuous variables, details of
measuring instruments etc. Sometimes statisticians get a photocopied
codebook or pdf but my preference (and that of aware researchers) is a
spreadsheet so that these meta data can be used.

For small data sets its probably OK to manually set up factor labels and
check for non-defined factor levels and identify out of range values for
continuous variables. For data sets with hundreds of variables or when
there are many data files with similar structure it is probably better to
automate these procedures.

A package for extracting information from codebooks and using the meta
data to assist labelling and data cleaning would prove useful to
statisticians and scientists at the coal face of producing primary data.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#46

@petebaker
Copy link
Author

petebaker commented Apr 21, 2016

Hi Ivan
Sorry I didn't look at this until now
I think we are looking at opposite ends of the same problem. As I see it, the generic core is how to keep and use codebook info (meta data) with a data frame or df_table or other object. This is great! My suggestion is at the end of reading in codebooks to obtain that info. Every project I work with (or at least say 90 in the last hundred - so I work with a lot of projects ;-) has a codebook and they are pretty similar in structure. So they have a certain, if somewhat variable, structure. For instance, just looked at missingDataGUI manual and a example data codebook is in rtf - much better if it was in .csv http://ftp.cdc.gov/pub/data/brfss/codebook_09.rtf at http://www.cdc.gov/brfss/data_documentation/index.htm
A wholistic approach to metadata sounds great. Pity you are in Canberra but that's the beauty of web
Suggestions welcome
Cheers Peter

@ivanhanigan
Copy link
Contributor

Yep.
Did you see this
http://jason.bryer.org/posts/2013-01-10/Function_for_Reading_Codebooks_in_R.html
I think the example codebook is broken and I had to fix it. I can attach to a repo somehow if you want to explore?
I still think / hope EML can do everything!

require(devtools)
source_gist(4497585)
## Sourcing https://gist.githubusercontent.com/jbryer/4497585/raw/ed0497d0283807b72aedbe3bcf752cd41dfd5ba0/parse.codebook.r

## SHA-1 hash of file is 39ed33b5c570e7c10e15e39842931e08ce11d89d
parse.codebook
## function(file, 
##                         var.names, 
##                         level.names,
##                         level.indent=c(' ','\t'),
##                         var.sep, 
##                         level.sep, 
##                         var.widths,
##                         level.widths,
##                         var.name = var.names[1],
##                         ...) {
##  stopifnot(var.name %in% var.names)
##  
##  stripWhite <- function(x) {
##      sub('([[:space:]] $)', '', sub("(^[[:space:]] )", '', x, perl=TRUE), perl=TRUE)
##  }
##  
##  codebook.raw <- readLines(file)
##  
##  #Remove blank lines
##  blanklines <- which(nchar(stripWhite(codebook.raw)) == 0)
##  linenums <- which(!(substr(codebook.raw, 1, 1) %in% level.indent))
##  linenums <- linenums[!linenums %in% blanklines]
##  linenums.levels <- which(substr(codebook.raw, 1, 1) %in% level.indent)
##  linenums.levels <- linenums.levels[!linenums.levels %in% blanklines]
##  if(length(blanklines) > 0) {
##      codebook.raw <- codebook.raw[-blanklines]       
##  }
##  
##  rows <- which(!(substr(codebook.raw, 1, 1) %in% level.indent))
##  rows.levels <- which(substr(codebook.raw, 1, 1) %in% level.indent)
##  rowmapping <- data.frame(pre=linenums, post=rows)
##  rowmapping.levels <- data.frame(pre=linenums.levels, post=rows.levels)
##  codebook <- codebook.raw[rows]
##  
##  if(!missing(var.sep)) { #Fields are delimited
##      split <- strsplit(codebook, var.sep, fixed=TRUE)
##      badrows <- codebook[sapply(split, length) != length(var.names)]
##      if(length(badrows) > 0) {
##          stop(paste("The following rows do not have ", length(var.names), 
##                     " values: ", paste(badrows, collapse=', '), sep=''))
##      }
##      codebook <- as.data.frame(matrix(stripWhite( 
##                      unlist(strsplit(codebook, var.sep, fixed=TRUE))),
##                      ncol=length(var.names), byrow=TRUE, 
##                      dimnames=list(1:length(rows), var.names)), 
##                      stringsAsFactors=FALSE)
##      codebook$linenum <- rows
##  } else if(!missing(var.widths)) { #Fields are fixed with
##      stopifnot(length(var.names) == length(var.widths))
##      left <- 1
##      cb <- data.frame(linenum=linenums)
##      for(i in seq_along(var.widths)) {
##          cb[,var.names[i]] <- sapply(
##              codebook, FUN=function(x) {
##                  stripWhite(substr(x, left, min(nchar(x), (left   var.widths[i]))))
##              })
##          left <- left   var.widths[i]
##      }
##      codebook <- cb
##      rm(cb)
##  } else {
##      stop("Must specify either var.sep or var.widths")
##  }
##  
##  varsWithFactors <- which(sapply(1:(length(rows)-1), 
##                      FUN=function(x) { rows[x]   1 != rows[x 1] }))
##  varlevels <- list()
##  for(i in seq_along(rows[varsWithFactors])) {
##      start <- rows[varsWithFactors][i]
##      end <- rows[which(rows == start)   1]
##      levels.raw <- codebook.raw[ (start   1):(end - 1) ]
##      if(!missing(level.widths)) { #Fixed with levels
##          levels.raw <- lapply(levels.raw, FUN=function(x) {
##              left <- 1
##              lc <- character()
##              for(i in seq_along(level.widths)) {
##                  lc <- c(lc, 
##                          substr(x, left, min(nchar(x), (left   level.widths[[i]])))
##                  )
##                  left <- left   level.widths[i]
##              }
##              return(lc)
##          })
##      } else if(!missing(level.sep)) { #Delimited levels
##          levels.raw <- strsplit(sub('(^[[:space:]] )', '', levels.raw, perl=TRUE), 
##                                 level.sep, fixed=TRUE)
##      } else {
##          stop('Must specify either level.sep or level.widths')
##      }
##      levels.df <- data.frame(linenum=rowmapping.levels[rowmapping.levels$post > start &
##                                             rowmapping.levels$post < end, 'pre'])
##      for(i in seq_along(level.names)) {
##          levels.df[,level.names[i]] <- sapply(levels.raw, FUN=function(x) { stripWhite(x[i]) })
##      }
##      var <- codebook[codebook$linenum == rowmapping[start == rowmapping$post,'pre'], var.name]
##      varlevels[[var]] <- levels.df
##  }
##  
##  codebook$isfactor <- codebook$var %in% names(varlevels)
##  
##  attr(codebook, 'levels') <- varlevels
##  class(codebook) <- c('codebook', 'data.frame')
##  return(codebook)
## }
"
http://jason.bryer.org/codebooks/ccdCodebook.txt
The codebook provided is in plain text and required two modifications: One, general file information at the top of the file was deleted, and two, any descriptions that spanned lines need to be modified so the are on only one line.
"
## [1] "\nhttp://jason.bryer.org/codebooks/ccdCodebook.txt\nThe codebook provided is in plain text and required two modifications: One, general file information at the top of the file was deleted, and two, any descriptions that spanned lines need to be modified so the are on only one line.\n"
setwd("~/tools/disentangle/codebooks_jbryer")
ccd.codebook <- parse.codebook(
  file = 'ccdCodebook.txt'
  , 
                                var.names=c('variable','order','type','description')
  ,
                                level.names=c('level','label')
  ,
                                level.sep='='
  , 
                                var.widths=c(13, 7, 7, Inf)
  )

head(ccd.codebook)
##   linenum variable order type
## 1       1 SURVYEAR     1   AN
## 2       3  NCESSCH     2   AN
## 3       5    FIPST     3   AN
## 4      17     IVAN     4   AN
##                                                                                      description
## 1                                                           Year corresponding to survey record.
## 2 Unique NCES public school ID (7-digit NCES agency ID (LEAID)   5-digit NCES school ID (SCHNO).
## 3                                      American National Standards Institute (ANSI) state code..
## 4                                                                                           Name
##   isfactor
## 1    FALSE
## 2    FALSE
## 3     TRUE
## 4    FALSE
str(ccd.codebook)
## Classes 'codebook' and 'data.frame': 4 obs. of  6 variables:
##  $ linenum    : int  1 3 5 17
##  $ variable   : chr  "SURVYEAR" "NCESSCH" "FIPST" "IVAN"
##  $ order      : chr  "1" "2" "3" "4"
##  $ type       : chr  "AN" "AN" "AN" "AN"
##  $ description: chr  "Year corresponding to survey record." "Unique NCES public school ID (7-digit NCES agency ID (LEAID)   5-digit NCES school ID (SCHNO)." "American National Standards Institute (ANSI) state code.." "Name"
##  $ isfactor   : logi  FALSE FALSE TRUE FALSE
##  - attr(*, "levels")=List of 1
##   ..$ FIPST:'data.frame':    9 obs. of  3 variables:
##   .. ..$ linenum: int  7 8 9 10 11 12 13 14 15
##   .. ..$ level  : chr  "01" "02" "04" "05" ...
##   .. ..$ label  : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
ccd.var.levels <- attr(ccd.codebook, 'levels')
names(ccd.var.levels)
## [1] "FIPST"
ccd.var.levels[['FIPST']]
##   linenum level                label
## 1       7    01              Alabama
## 2       8    02               Alaska
## 3       9    04              Arizona
## 4      10    05             Arkansas
## 5      11    06           California
## 6      12    08             Colorado
## 7      13    09          Connecticut
## 8      14    10             Delaware
## 9      15    11 District of Columbia

@petebaker
Copy link
Author

Will explore after lunch - been exploring my codebook dryworkflow stuff on github which I haven't looked at for a year - too much on unfortunately but this is perfect excuse. Cheers

@TanyaMurphy
Copy link

Hi,
Very excited to find this thread. I had thrown together a note for myself a few days ago, which I included below. I will look at your posts more closely in a couple of weeks and try to make connections. In the meantime, you may find these resources interesting (if you haven't already seen them).

Regarding my Excel reference: Not a fan of it for my solo work, but it's popular with research coordinators and serves as a common tool for many research teams.


Could there be convergence between the data-cleaning systems of

In particular:

  • facilitate using and enhancing Excel-based data dictionary/codebook that contains (obviously) variable names, labels, description, units (if applicable), flagging/cleaning rules, factoring/categorization rules for individual variables. Flagging/cleaning and categorization rules for joint distributions -- ideally be used as inputs to emerging functions that replace cumbersome nested ifelse staements.

Rules-based programming - SE

tidyverse/tidyr#183

tidyverse/dplyr#631 (incidentally, Edwin de Jonge started this thread, but did not comment again - are they taking a different path with their deducorrect and Data cleaning packages?)

Is this relevant?
ropenscilabs/assertr: Assertive programming for R analysis pipelines


I look forward to following the developments on this topic. Cheers!

@ivanhanigan
Copy link
Contributor

ivanhanigan commented Jan 19, 2017 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants