Skip to content
/ gido Public

Read any type of "structured" data and output in a tabular format.

Notifications You must be signed in to change notification settings

tteigman/gido

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Welcome to the Garbage in Data Out Toolkit

FAQ

What is this?

Read any type of "structured" data and output in a tabular format.

Who is this for?

  • Organizations who need to transform client reports for input into an application database.
  • Organizations who are migrating from spreadsheets and the like into an application. (e.g. sales data into a CRM)

How does that even work?

When this is done manually someone will look over the data (commonly in Excel) and start performing manipulations to create a standard tabular layout. Depending on the input data this can be quite laborious. I'm proposing using statistical techniques to determine header information and pattern recognition to classify the data.

How can this be accomplished if this problem requires very high if not exact accuracy?

Consider a jigsaw puzzle, it's possible you won't get it right but it will be obvious that you're wrong. Also, as more pieces are put into place the puzzle will become easier. The assumption is while the input data is not structured for inputting into database, there is structure to it and rows or sections of cells will fit together analogous to a puzzle.

Techniques

Header guessing

Have you ever accidently sorted your headers into your data in Excel? Besides not having headers, it's pretty obvious that the headers don't fit with the rest of the data. Even in a dataset that has duplicated headers we can determine which data doesn't fit by extracting features that differentiate the header/s from the rest of the data.

About

Read any type of "structured" data and output in a tabular format.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages