Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential bug in reading SAS files with CHAR (RLE) compression and many repeated characters #31243

Closed
peterzsohar opened this issue Jan 23, 2020 · 10 comments · Fixed by #47113
Closed
Labels
Bug IO SAS SAS: read_sas
Milestone

Comments

@peterzsohar
Copy link

peterzsohar commented Jan 23, 2020

Hi,

I think I ran into a bug in the RLE decompression implementation.

Short description:
String fields with more than 32 repeated consecutive characters are be cropped at 32 and next fields will spill over corrupting the whole dataframe.

Example:

example.csv with fields of length 50

long_string_field1,long_string_field2,long_string_field3
"00000000000000000000000000000000000000000000000000","11111111111111111111111111111111111111111111111111","aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"

Create a CHAR compressed sas7bdat file (system encoding is set to latin1)

options compress=char;
proc import datafile="path\example.csv"
        out=your_lib.example
        dbms=csv
        replace;
        getnames=yes;
run;
import pandas as pd
example = pd.read_sas("./example.sas7bdat", encoding="latin1")

This is what you get:

>>> example['long_string_field1'].values[0]
'00000000000000000000000000000000  111111111111111111'
>>> example['long_string_field2'].values[0]
'11111111111111  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
>>> example['long_string_field3'].values[0]
nan

There are a couple of interesting points:

  1. Exactly 32 characters are read/written from each field. Originals had 50 characters.
  2. The dataframe fields are filled up to the 50 limit ( 2 additional spaces between the sources of the fields. In case of consecutive integers) with spillovers from next fields.
  3. This only happens if one uses the CHAR compression. (RLE)
  4. This only happens if you have repeated consecutive characters.
  5. The sas7bdat package works fine.

DISCLAIMER

I would never use any of the things above out of my free will. Sadly, this is an actual case I keep running into when having to deal with SAS... 😢

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.6.9.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
machine          : AMD64
processor        : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : None.None

pandas           : 0.25.3
numpy            : 1.17.5
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 19.3.1
setuptools       : 45.1.0.post20200119
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.8.4 (dt dec pq3 ext lo64)
jinja2           : 2.10.3
IPython          : 7.11.1
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.15.1
pytables         : None
s3fs             : None
scipy            : None
sqlalchemy       : 1.3.12
tables           : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
@Eric-Sommer
Copy link

I always read SAS files with encoding='iso-8859-1' (see here). Does that help in your case?

@peterzsohar
Copy link
Author

I always read SAS files with encoding='iso-8859-1' (see here). Does that help in your case?

Thanks, but unfortunately no.
latin1 is the encoding that was used for the original sas7bdat so this is in fact the correct encoding to be used when reading the file to pandas.
(also, latin1 is just an alias for iso-8859-1 😉 )

Nevertheless, before submitting the issue I ran a script to check if the issue persists if using any other enconding for reading the data. (It does.)

Not sure if it is the case if another encoding is used when writing the original sas7bdat file.

@Eric-Sommer
Copy link

Eric-Sommer commented Jan 24, 2020

(also, latin1 is just an alias for iso-8859-1 😉 )

thanks for that info :)

@jbrockmendel jbrockmendel added the IO SAS SAS: read_sas label Jan 26, 2020
@mroeschke mroeschke added the Bug label Apr 5, 2020
@paul-lilley
Copy link
Contributor

paul-lilley commented Sep 3, 2020

Hi,
I can confirm this is a problem with decoding the RLE compression, not related to encoding.
Looking at the docs for sas7bdat (actually the docs for the R package that sas7bdat is based on) SAS has an interesting compression with differing (and overlapping) options for how single and consecutive fields are compressed. It looks like pandas correctly decompresses some of the options, but not all.

I've tested a bit and found that if there are preceeding fields, then a field with 18 consecutive zeros that is OK, but preceeding fields then 19 zeros is NOT decompressed correctly.

Stepping through the code for sas7bdat, for the case that works the control byte for the 18 zeros is hex C, whereas for 19 zeros it is hex 4. The meaning of C is documented https://cran.r-project.org/web/packages/sas7bdat/vignettes/sas7bdat.pdf but hex 4 is not.
Control byte hex 4 is implemented in both sas7bdat and in pandas, but the implementation appears different.

sas7bdat (pure python):

            elif control_byte == 0x40:
                copy_counter = (
                    end_of_first_byte * 16  
                    (b(page[offset   i   1]) & 0xFF)
                )
                for _ in xrange(copy_counter   18):
                    result.append(c(page[offset   i   2]))
                    current_result_array_index  = 1
                i  = 2

pandas (cython):

        elif control_byte == 0x40:
            # not documented
            nbytes = end_of_first_byte * 16
            nbytes  = <int>(inbuff[ipos])
            ipos  = 1
            for _ in range(nbytes):
                result[rpos] = inbuff[ipos]
                rpos  = 1
            ipos  = 1

Looking around at other R/C implementations for reading SAS files, the Readstat C library by Evan Miller https://github.com/WizardMac/ReadStat/blob/master/src/sas/readstat_sas_rle.c could be a useful source of info and appears to be actively maintained. This also has code for control byte 0x50 (missing from both sas7bdat and pandas).

#define SAS_RLE_COMMAND_INSERT_BYTE18   4
#define SAS_RLE_COMMAND_INSERT_AT17     5
...
            case SAS_RLE_COMMAND_INSERT_BYTE18:
                insert_len = (*input  )   18   length * 256;
                insert_byte = *input  ;
                break;
            case SAS_RLE_COMMAND_INSERT_AT17:
                insert_len = (*input  )   17   length * 256;
                insert_byte = '@';
                break;

There is also a python wrapper https://github.com/Roche/pyreadstat around Evan Miller's library.

@paul-lilley
Copy link
Contributor

I can try to put together a PR to alter the decompression for control code 0x40 (and add code for 0x50), or (since we have other issues related to SAS files) should we look to use Evan Miller's library or pyreadstat rather than duplicating others' efforts within pandas? @jbrockmendel, @mroeschke - what do you think?

@mroeschke
Copy link
Member

cc @bashtage

@ofajardo
Copy link

ofajardo commented Dec 9, 2020

pyreadstat (it is a python wrapper around Evan Miller's Readstat) is already used in pandas.read_spss so I guess it would be an easy one to also provide it as an backend option for read_sas (see #5768, #26537).

This move would also solve other issues such as #37088, #35545, #22720, #18198

@paul-lilley
Copy link
Contributor

paul-lilley commented Dec 12, 2020

@ofajardo I agree pyreadstat is also significantly faster than pandas cython implementation

@bashtage
Copy link
Contributor

It sounds like a no brainer to move to pyreadstat given it is already a soft dep. The only question is what will the path look like. It would probably require adding optional support, deprecating pandas build-in, and then eventually removing the native reader.

@ofajardo
Copy link

Sounds like a plan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO SAS SAS: read_sas
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants