You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Real world data is almost never perfect. Things like minor raggedness in a CSV can be caused by any number of things ranging from missing quotes around a string that contains the delimiter or simple typos. One of R's greatest strengths is just how good it is at dealing with situations like this. For example previously the trivial solution was defining placeholder columns in col_names (or equivalent). This would allow you to read the data and then clean it inside R:
In vroom, and now new versions of readr, this is impossible. Even with col_names explicitly defined there is no way to force readr/vroom to do the right thing.
The first is that a de facto monopoly in the R ecosystem has once again made a very user-hostile breaking change without any announcement, warning, or even documentation.
The second is the documentation. Not only is this behavior not documented, the documentation that does exist explicitly leads users to believe the opposite will happen:
col_names
Either TRUE, FALSE or a character vector of column names.
If TRUE, the first row of the input will be used as the column names, and will not be included in the data frame. If FALSE, column names will be generated automatically: X1, X2, X3 etc.
If col_names is a character vector, the values will be used as the names of the columns, and the first row of the input will be read into the first row of the output data frame.
Missing (NA) column names will generate a warning, and be filled in with dummy names ...1, ...2 etc. Duplicate column names will generate a warning and be made unique, see name_repair to control how this is done.
And the third is the behavior itself. It's a severe antipattern to have an argument like col_names and then silently ignore the user's input, leaving them wondering why they've provided 5 column names and the function is giving errors about expecting 4 columns.
The ideal solution is obviously that user input should be authoritative. If a user supplies 5 columns vroom should return 5 columns with NAs where appropriate. But at absolute minimum the documentation should be changed to explicitly state that col_names is only a suggestion and will be ignored based on what vroom decides under the hood.
The text was updated successfully, but these errors were encountered:
Real world data is almost never perfect. Things like minor raggedness in a CSV can be caused by any number of things ranging from missing quotes around a string that contains the delimiter or simple typos. One of R's greatest strengths is just how good it is at dealing with situations like this. For example previously the trivial solution was defining placeholder columns in
col_names
(or equivalent). This would allow you to read the data and then clean it inside R:In vroom, and now new versions of readr, this is impossible. Even with
col_names
explicitly defined there is no way to force readr/vroom to do the right thing.There's three very important issues here.
The first is that a de facto monopoly in the R ecosystem has once again made a very user-hostile breaking change without any announcement, warning, or even documentation.
The second is the documentation. Not only is this behavior not documented, the documentation that does exist explicitly leads users to believe the opposite will happen:
And the third is the behavior itself. It's a severe antipattern to have an argument like
col_names
and then silently ignore the user's input, leaving them wondering why they've provided 5 column names and the function is giving errors about expecting 4 columns.The ideal solution is obviously that user input should be authoritative. If a user supplies 5 columns vroom should return 5 columns with NAs where appropriate. But at absolute minimum the documentation should be changed to explicitly state that
col_names
is only a suggestion and will be ignored based on what vroom decides under the hood.The text was updated successfully, but these errors were encountered: