R package to aid cleaning/checking/formatting data using Codebooks/Data Dictionaries #46

petebaker · 2016-04-20T13:25:47Z

Subject: R package to aid cleaning/checking/formatting data using Codebooks/Data Dictionaries

Statisticians and scientists producing primary data often have different needs to those scraping secondary and tertiary data off the web.

Often in medical, epidemiological, social, psych and other scientific studies, researchers use codebooks to document data formats, variable labels, factor labels, ranges for continuous variables, details of measuring instruments etc. Sometimes statisticians get a photocopied codebook or pdf but my preference (and that of aware researchers) is a spreadsheet so that these meta data can be used.

For small data sets its probably OK to manually set up factor labels and check for non-defined factor levels and identify out of range values for continuous variables. For data sets with hundreds of variables or when there are many data files with similar structure it is probably better to automate these procedures.

A package for extracting information from codebooks and using the meta data to assist labelling and data cleaning would prove useful to statisticians and scientists at the coal face of producing primary data.

ivanhanigan · 2016-04-20T13:47:27Z

Hi Peter, this is a direct hit on my main problem/use case for a new R tool!
Do you think #11 and
#18 also cover similar
territory?
I think a 'metadata' subgroup to the AU unconf would be wonderful.
I am attending remotely from Canberra, and have a serious interest in
spending the next couple days working on 'metadata-driven' tools to aid
cleaning/checking/formatting/documenting our data.

On Wed, Apr 20, 2016 at 11:25 PM, Peter Baker [email protected]
wrote:

Subject: R package to aid cleaning/checking/formatting data using
Codebooks/Data Dictionaries

Statisticians and scientists producing primary data often have different
needs to those scraping secondary and tertiary data off the web.

Often in medical, epidemiological, social, psych and other scientific
studies, researchers use codebooks to document data formats, variable
labels, factor labels, ranges for continuous variables, details of
measuring instruments etc. Sometimes statisticians get a photocopied
codebook or pdf but my preference (and that of aware researchers) is a
spreadsheet so that these meta data can be used.

For small data sets its probably OK to manually set up factor labels and
check for non-defined factor levels and identify out of range values for
continuous variables. For data sets with hundreds of variables or when
there are many data files with similar structure it is probably better to
automate these procedures.

A package for extracting information from codebooks and using the meta
data to assist labelling and data cleaning would prove useful to
statisticians and scientists at the coal face of producing primary data.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#46

petebaker · 2016-04-21T02:59:09Z

Hi Ivan
Sorry I didn't look at this until now
I think we are looking at opposite ends of the same problem. As I see it, the generic core is how to keep and use codebook info (meta data) with a data frame or df_table or other object. This is great! My suggestion is at the end of reading in codebooks to obtain that info. Every project I work with (or at least say 90 in the last hundred - so I work with a lot of projects ;-) has a codebook and they are pretty similar in structure. So they have a certain, if somewhat variable, structure. For instance, just looked at missingDataGUI manual and a example data codebook is in rtf - much better if it was in .csv http://ftp.cdc.gov/pub/data/brfss/codebook_09.rtf at http://www.cdc.gov/brfss/data_documentation/index.htm
A wholistic approach to metadata sounds great. Pity you are in Canberra but that's the beauty of web
Suggestions welcome
Cheers Peter

ivanhanigan · 2016-04-21T03:14:36Z

Yep.
Did you see this
http://jason.bryer.org/posts/2013-01-10/Function_for_Reading_Codebooks_in_R.html
I think the example codebook is broken and I had to fix it. I can attach to a repo somehow if you want to explore?
I still think / hope EML can do everything!

require(devtools)
source_gist(4497585)

## Sourcing https://gist.githubusercontent.com/jbryer/4497585/raw/ed0497d0283807b72aedbe3bcf752cd41dfd5ba0/parse.codebook.r

## SHA-1 hash of file is 39ed33b5c570e7c10e15e39842931e08ce11d89d

parse.codebook

## function(file, 
##                         var.names, 
##                         level.names,
##                         level.indent=c(' ','\t'),
##                         var.sep, 
##                         level.sep, 
##                         var.widths,
##                         level.widths,
##                         var.name = var.names[1],
##                         ...) {
##  stopifnot(var.name %in% var.names)
##  
##  stripWhite <- function(x) {
##      sub('([[:space:]] $)', '', sub("(^[[:space:]] )", '', x, perl=TRUE), perl=TRUE)
##  }
##  
##  codebook.raw <- readLines(file)
##  
##  #Remove blank lines
##  blanklines <- which(nchar(stripWhite(codebook.raw)) == 0)
##  linenums <- which(!(substr(codebook.raw, 1, 1) %in% level.indent))
##  linenums <- linenums[!linenums %in% blanklines]
##  linenums.levels <- which(substr(codebook.raw, 1, 1) %in% level.indent)
##  linenums.levels <- linenums.levels[!linenums.levels %in% blanklines]
##  if(length(blanklines) > 0) {
##      codebook.raw <- codebook.raw[-blanklines]       
##  }
##  
##  rows <- which(!(substr(codebook.raw, 1, 1) %in% level.indent))
##  rows.levels <- which(substr(codebook.raw, 1, 1) %in% level.indent)
##  rowmapping <- data.frame(pre=linenums, post=rows)
##  rowmapping.levels <- data.frame(pre=linenums.levels, post=rows.levels)
##  codebook <- codebook.raw[rows]
##  
##  if(!missing(var.sep)) { #Fields are delimited
##      split <- strsplit(codebook, var.sep, fixed=TRUE)
##      badrows <- codebook[sapply(split, length) != length(var.names)]
##      if(length(badrows) > 0) {
##          stop(paste("The following rows do not have ", length(var.names), 
##                     " values: ", paste(badrows, collapse=', '), sep=''))
##      }
##      codebook <- as.data.frame(matrix(stripWhite( 
##                      unlist(strsplit(codebook, var.sep, fixed=TRUE))),
##                      ncol=length(var.names), byrow=TRUE, 
##                      dimnames=list(1:length(rows), var.names)), 
##                      stringsAsFactors=FALSE)
##      codebook$linenum <- rows
##  } else if(!missing(var.widths)) { #Fields are fixed with
##      stopifnot(length(var.names) == length(var.widths))
##      left <- 1
##      cb <- data.frame(linenum=linenums)
##      for(i in seq_along(var.widths)) {
##          cb[,var.names[i]] <- sapply(
##              codebook, FUN=function(x) {
##                  stripWhite(substr(x, left, min(nchar(x), (left   var.widths[i]))))
##              })
##          left <- left   var.widths[i]
##      }
##      codebook <- cb
##      rm(cb)
##  } else {
##      stop("Must specify either var.sep or var.widths")
##  }
##  
##  varsWithFactors <- which(sapply(1:(length(rows)-1), 
##                      FUN=function(x) { rows[x]   1 != rows[x 1] }))
##  varlevels <- list()
##  for(i in seq_along(rows[varsWithFactors])) {
##      start <- rows[varsWithFactors][i]
##      end <- rows[which(rows == start)   1]
##      levels.raw <- codebook.raw[ (start   1):(end - 1) ]
##      if(!missing(level.widths)) { #Fixed with levels
##          levels.raw <- lapply(levels.raw, FUN=function(x) {
##              left <- 1
##              lc <- character()
##              for(i in seq_along(level.widths)) {
##                  lc <- c(lc, 
##                          substr(x, left, min(nchar(x), (left   level.widths[[i]])))
##                  )
##                  left <- left   level.widths[i]
##              }
##              return(lc)
##          })
##      } else if(!missing(level.sep)) { #Delimited levels
##          levels.raw <- strsplit(sub('(^[[:space:]] )', '', levels.raw, perl=TRUE), 
##                                 level.sep, fixed=TRUE)
##      } else {
##          stop('Must specify either level.sep or level.widths')
##      }
##      levels.df <- data.frame(linenum=rowmapping.levels[rowmapping.levels$post > start &
##                                             rowmapping.levels$post < end, 'pre'])
##      for(i in seq_along(level.names)) {
##          levels.df[,level.names[i]] <- sapply(levels.raw, FUN=function(x) { stripWhite(x[i]) })
##      }
##      var <- codebook[codebook$linenum == rowmapping[start == rowmapping$post,'pre'], var.name]
##      varlevels[[var]] <- levels.df
##  }
##  
##  codebook$isfactor <- codebook$var %in% names(varlevels)
##  
##  attr(codebook, 'levels') <- varlevels
##  class(codebook) <- c('codebook', 'data.frame')
##  return(codebook)
## }

"
http://jason.bryer.org/codebooks/ccdCodebook.txt
The codebook provided is in plain text and required two modifications: One, general file information at the top of the file was deleted, and two, any descriptions that spanned lines need to be modified so the are on only one line.
"

## [1] "\nhttp://jason.bryer.org/codebooks/ccdCodebook.txt\nThe codebook provided is in plain text and required two modifications: One, general file information at the top of the file was deleted, and two, any descriptions that spanned lines need to be modified so the are on only one line.\n"

setwd("~/tools/disentangle/codebooks_jbryer")
ccd.codebook <- parse.codebook(
  file = 'ccdCodebook.txt'
  , 
                                var.names=c('variable','order','type','description')
  ,
                                level.names=c('level','label')
  ,
                                level.sep='='
  , 
                                var.widths=c(13, 7, 7, Inf)
  )

head(ccd.codebook)

##   linenum variable order type
## 1       1 SURVYEAR     1   AN
## 2       3  NCESSCH     2   AN
## 3       5    FIPST     3   AN
## 4      17     IVAN     4   AN
##                                                                                      description
## 1                                                           Year corresponding to survey record.
## 2 Unique NCES public school ID (7-digit NCES agency ID (LEAID)   5-digit NCES school ID (SCHNO).
## 3                                      American National Standards Institute (ANSI) state code..
## 4                                                                                           Name
##   isfactor
## 1    FALSE
## 2    FALSE
## 3     TRUE
## 4    FALSE

str(ccd.codebook)

## Classes 'codebook' and 'data.frame': 4 obs. of  6 variables:
##  $ linenum    : int  1 3 5 17
##  $ variable   : chr  "SURVYEAR" "NCESSCH" "FIPST" "IVAN"
##  $ order      : chr  "1" "2" "3" "4"
##  $ type       : chr  "AN" "AN" "AN" "AN"
##  $ description: chr  "Year corresponding to survey record." "Unique NCES public school ID (7-digit NCES agency ID (LEAID)   5-digit NCES school ID (SCHNO)." "American National Standards Institute (ANSI) state code.." "Name"
##  $ isfactor   : logi  FALSE FALSE TRUE FALSE
##  - attr(*, "levels")=List of 1
##   ..$ FIPST:'data.frame':    9 obs. of  3 variables:
##   .. ..$ linenum: int  7 8 9 10 11 12 13 14 15
##   .. ..$ level  : chr  "01" "02" "04" "05" ...
##   .. ..$ label  : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...

ccd.var.levels <- attr(ccd.codebook, 'levels')
names(ccd.var.levels)

## [1] "FIPST"

ccd.var.levels[['FIPST']]

##   linenum level                label
## 1       7    01              Alabama
## 2       8    02               Alaska
## 3       9    04              Arizona
## 4      10    05             Arkansas
## 5      11    06           California
## 6      12    08             Colorado
## 7      13    09          Connecticut
## 8      14    10             Delaware
## 9      15    11 District of Columbia

petebaker · 2016-04-21T03:45:08Z

Will explore after lunch - been exploring my codebook dryworkflow stuff on github which I haven't looked at for a year - too much on unfortunately but this is perfect excuse. Cheers

TanyaMurphy · 2017-01-19T15:56:26Z

Hi,
Very excited to find this thread. I had thrown together a note for myself a few days ago, which I included below. I will look at your posts more closely in a couple of weeks and try to make connections. In the meantime, you may find these resources interesting (if you haven't already seen them).

Regarding my Excel reference: Not a fan of it for my solo work, but it's popular with research coordinators and serves as a common tool for many research teams.

Could there be convergence between the data-cleaning systems of

Edwin de Jonge and Mark van der Loo
Win-Vector "parametrized" data cleaning e.g.
- Using replyr::let to Parameterize dplyr Expressions – Win-Vector Blog #rlang #data-prep
- variable treatment – Win-Vector Blog #rlang #data-prep
- Parametric variable names and dplyr – Win-Vector Blog
tidyverse

In particular:

facilitate using and enhancing Excel-based data dictionary/codebook that contains (obviously) variable names, labels, description, units (if applicable), flagging/cleaning rules, factoring/categorization rules for individual variables. Flagging/cleaning and categorization rules for joint distributions -- ideally be used as inputs to emerging functions that replace cumbersome nested ifelse staements.

Rules-based programming - SE

tidyverse/tidyr#183

tidyverse/dplyr#631 (incidentally, Edwin de Jonge started this thread, but did not comment again - are they taking a different path with their deducorrect and Data cleaning packages?)

Is this relevant?
ropenscilabs/assertr: Assertive programming for R analysis pipelines

I look forward to following the developments on this topic. Cheers!

ivanhanigan · 2017-01-19T22:28:15Z

Hi Tanya, good to hear from you. Also check out the EML package http://ropensci.github.io/EML/ (on CRAN) which allows define the attributes of data objects such as geographic coverage, temporal coverage, type (eg. data.frame, image, spatialPolygon), missing values, enumerated value lists and numeric ranges, along with copious amounts of relevant metadata such as context of who, why, IP, licencing etc etc.

…

On Fri, Jan 20, 2017 at 2:56 AM, Tanya Murphy ***@***.***> wrote: Hi, Very excited to find this thread. I had thrown together a note for myself a few days ago, which I included below. I will look at your posts more closely in a couple of weeks and try to make connections. In the meantime, you may find these resources interesting (if you haven't already seen them). Regarding my Excel reference: Not a fan of it for my solo work, but it's popular with research coordinators and serves as a common tool for many research teams. ------------------------------ Could there be convergence between the data-cleaning systems of - Edwin de Jonge and Mark van der Loo <https://github.com/data-cleaning> - Win-Vector "parametrized" data cleaning e.g. - Using replyr::let to Parameterize dplyr Expressions – Win-Vector Blog <http://www.win-vector.com/blog/2016/12/using-replyrlet-to-parameterize-dplyr-expressions/> #rlang #data-prep - variable treatment – Win-Vector Blog <http://www.win-vector.com/blog/tag/variable-treatment/> #rlang #data-prep - Parametric variable names and dplyr – Win-Vector Blog <http://www.win-vector.com/blog/2016/12/parametric-variable-names-and-dplyr/> - tidyverse In particular: - facilitate using and enhancing Excel-based data dictionary/codebook that contains (obviously) variable names, labels, description, units (if applicable), flagging/cleaning rules, factoring/categorization rules for individual variables. Flagging/cleaning and categorization rules for joint distributions -- ideally be used as inputs to emerging functions that replace cumbersome nested ifelse staements. Rules-based programming - SE <http://softwareengineering.stackexchange.com/questions/103659/how-can-one-manage-thousands-of-if-then-else-rules> tidyverse/tidyr#183 <tidyverse/tidyr#183> tidyverse/dplyr#631 <tidyverse/dplyr#631> (incidentally, Edwin de Jonge started this thread, but did not comment again - are they taking a different path with their deducorrect and Data cleaning packages <https://github.com/data-cleaning>?) Is this relevant? ropenscilabs/assertr: Assertive programming for R analysis pipelines <https://github.com/ropenscilabs/assertr> ------------------------------ I look forward to following the developments on this topic. Cheers! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#46 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABPsp2dHe2Zo31eoqW-0ZKhrTc34VF4Yks5rT4ergaJpZM4ILsFR> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R package to aid cleaning/checking/formatting data using Codebooks/Data Dictionaries #46

R package to aid cleaning/checking/formatting data using Codebooks/Data Dictionaries #46

petebaker commented Apr 20, 2016

ivanhanigan commented Apr 20, 2016

petebaker commented Apr 21, 2016 •

edited

Loading

ivanhanigan commented Apr 21, 2016

petebaker commented Apr 21, 2016

TanyaMurphy commented Jan 19, 2017

ivanhanigan commented Jan 19, 2017 via email

R package to aid cleaning/checking/formatting data using Codebooks/Data Dictionaries #46

R package to aid cleaning/checking/formatting data using Codebooks/Data Dictionaries #46

Comments

petebaker commented Apr 20, 2016

ivanhanigan commented Apr 20, 2016

petebaker commented Apr 21, 2016 • edited Loading

ivanhanigan commented Apr 21, 2016

petebaker commented Apr 21, 2016

TanyaMurphy commented Jan 19, 2017

ivanhanigan commented Jan 19, 2017 via email

petebaker commented Apr 21, 2016 •

edited

Loading