-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential bug in reading SAS files with CHAR (RLE) compression and many repeated characters #31243
Comments
I always read SAS files with |
Thanks, but unfortunately no. Nevertheless, before submitting the issue I ran a script to check if the issue persists if using any other enconding for reading the data. (It does.) Not sure if it is the case if another encoding is used when writing the original |
thanks for that info :) |
Hi, I've tested a bit and found that if there are preceeding fields, then a field with 18 consecutive zeros that is OK, but preceeding fields then 19 zeros is NOT decompressed correctly. Stepping through the code for sas7bdat, for the case that works the control byte for the 18 zeros is hex C, whereas for 19 zeros it is hex 4. The meaning of C is documented https://cran.r-project.org/web/packages/sas7bdat/vignettes/sas7bdat.pdf but hex 4 is not. sas7bdat (pure python): elif control_byte == 0x40:
copy_counter = (
end_of_first_byte * 16
(b(page[offset i 1]) & 0xFF)
)
for _ in xrange(copy_counter 18):
result.append(c(page[offset i 2]))
current_result_array_index = 1
i = 2 pandas (cython): elif control_byte == 0x40:
# not documented
nbytes = end_of_first_byte * 16
nbytes = <int>(inbuff[ipos])
ipos = 1
for _ in range(nbytes):
result[rpos] = inbuff[ipos]
rpos = 1
ipos = 1 Looking around at other R/C implementations for reading SAS files, the Readstat C library by Evan Miller https://github.com/WizardMac/ReadStat/blob/master/src/sas/readstat_sas_rle.c could be a useful source of info and appears to be actively maintained. This also has code for control byte 0x50 (missing from both sas7bdat and pandas). #define SAS_RLE_COMMAND_INSERT_BYTE18 4
#define SAS_RLE_COMMAND_INSERT_AT17 5
...
case SAS_RLE_COMMAND_INSERT_BYTE18:
insert_len = (*input ) 18 length * 256;
insert_byte = *input ;
break;
case SAS_RLE_COMMAND_INSERT_AT17:
insert_len = (*input ) 17 length * 256;
insert_byte = '@';
break; There is also a python wrapper https://github.com/Roche/pyreadstat around Evan Miller's library. |
I can try to put together a PR to alter the decompression for control code 0x40 (and add code for 0x50), or (since we have other issues related to SAS files) should we look to use Evan Miller's library or pyreadstat rather than duplicating others' efforts within pandas? @jbrockmendel, @mroeschke - what do you think? |
cc @bashtage |
@ofajardo I agree pyreadstat is also significantly faster than pandas cython implementation |
It sounds like a no brainer to move to pyreadstat given it is already a soft dep. The only question is what will the path look like. It would probably require adding optional support, deprecating pandas build-in, and then eventually removing the native reader. |
Sounds like a plan! |
Hi,
I think I ran into a bug in the RLE decompression implementation.
Short description:
String fields with more than 32 repeated consecutive characters are
becropped at 32 and next fields will spill over corrupting the whole dataframe.Example:
example.csv with fields of length 50
Create a CHAR compressed sas7bdat file (system encoding is set to latin1)
This is what you get:
There are a couple of interesting points:
sas7bdat
package works fine.DISCLAIMER
I would never use any of the things above out of my free will. Sadly, this is an actual case I keep running into when having to deal with SAS... 😢
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: