Potential bug in reading SAS files with CHAR (RLE) compression and many repeated characters #31243

peterzsohar · 2020-01-23T11:08:19Z

Hi,

I think I ran into a bug in the RLE decompression implementation.

Short description:
String fields with more than 32 repeated consecutive characters are be cropped at 32 and next fields will spill over corrupting the whole dataframe.

Example:

example.csv with fields of length 50

long_string_field1,long_string_field2,long_string_field3
"00000000000000000000000000000000000000000000000000","11111111111111111111111111111111111111111111111111","aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"

Create a CHAR compressed sas7bdat file (system encoding is set to latin1)

options compress=char;
proc import datafile="path\example.csv"
        out=your_lib.example
        dbms=csv
        replace;
        getnames=yes;
run;

import pandas as pd
example = pd.read_sas("./example.sas7bdat", encoding="latin1")

This is what you get:

>>> example['long_string_field1'].values[0]
'00000000000000000000000000000000  111111111111111111'

>>> example['long_string_field2'].values[0]
'11111111111111  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'

>>> example['long_string_field3'].values[0]
nan

There are a couple of interesting points:

Exactly 32 characters are read/written from each field. Originals had 50 characters.
The dataframe fields are filled up to the 50 limit ( 2 additional spaces between the sources of the fields. In case of consecutive integers) with spillovers from next fields.
This only happens if one uses the CHAR compression. (RLE)
This only happens if you have repeated consecutive characters.
The sas7bdat package works fine.

DISCLAIMER

I would never use any of the things above out of my free will. Sadly, this is an actual case I keep running into when having to deal with SAS... 😢

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.6.9.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
machine          : AMD64
processor        : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : None.None

pandas           : 0.25.3
numpy            : 1.17.5
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 19.3.1
setuptools       : 45.1.0.post20200119
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.8.4 (dt dec pq3 ext lo64)
jinja2           : 2.10.3
IPython          : 7.11.1
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.15.1
pytables         : None
s3fs             : None
scipy            : None
sqlalchemy       : 1.3.12
tables           : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None

The text was updated successfully, but these errors were encountered:

Eric-Sommer · 2020-01-23T15:10:15Z

I always read SAS files with encoding='iso-8859-1' (see here). Does that help in your case?

peterzsohar · 2020-01-24T15:09:59Z

I always read SAS files with encoding='iso-8859-1' (see here). Does that help in your case?

Thanks, but unfortunately no.
latin1 is the encoding that was used for the original sas7bdat so this is in fact the correct encoding to be used when reading the file to pandas.
(also, latin1 is just an alias for iso-8859-1 😉 )

Nevertheless, before submitting the issue I ran a script to check if the issue persists if using any other enconding for reading the data. (It does.)

Not sure if it is the case if another encoding is used when writing the original sas7bdat file.

Eric-Sommer · 2020-01-24T15:39:52Z

(also, latin1 is just an alias for iso-8859-1 😉 )

thanks for that info :)

paul-lilley · 2020-09-03T10:02:23Z

Hi,
I can confirm this is a problem with decoding the RLE compression, not related to encoding.
Looking at the docs for sas7bdat (actually the docs for the R package that sas7bdat is based on) SAS has an interesting compression with differing (and overlapping) options for how single and consecutive fields are compressed. It looks like pandas correctly decompresses some of the options, but not all.

I've tested a bit and found that if there are preceeding fields, then a field with 18 consecutive zeros that is OK, but preceeding fields then 19 zeros is NOT decompressed correctly.

Stepping through the code for sas7bdat, for the case that works the control byte for the 18 zeros is hex C, whereas for 19 zeros it is hex 4. The meaning of C is documented https://cran.r-project.org/web/packages/sas7bdat/vignettes/sas7bdat.pdf but hex 4 is not.
Control byte hex 4 is implemented in both sas7bdat and in pandas, but the implementation appears different.

sas7bdat (pure python):

            elif control_byte == 0x40:
                copy_counter = (
                    end_of_first_byte * 16  
                    (b(page[offset   i   1]) & 0xFF)
                )
                for _ in xrange(copy_counter   18):
                    result.append(c(page[offset   i   2]))
                    current_result_array_index  = 1
                i  = 2

pandas (cython):

        elif control_byte == 0x40:
            # not documented
            nbytes = end_of_first_byte * 16
            nbytes  = <int>(inbuff[ipos])
            ipos  = 1
            for _ in range(nbytes):
                result[rpos] = inbuff[ipos]
                rpos  = 1
            ipos  = 1

Looking around at other R/C implementations for reading SAS files, the Readstat C library by Evan Miller https://github.com/WizardMac/ReadStat/blob/master/src/sas/readstat_sas_rle.c could be a useful source of info and appears to be actively maintained. This also has code for control byte 0x50 (missing from both sas7bdat and pandas).

#define SAS_RLE_COMMAND_INSERT_BYTE18   4
#define SAS_RLE_COMMAND_INSERT_AT17     5
...
            case SAS_RLE_COMMAND_INSERT_BYTE18:
                insert_len = (*input  )   18   length * 256;
                insert_byte = *input  ;
                break;
            case SAS_RLE_COMMAND_INSERT_AT17:
                insert_len = (*input  )   17   length * 256;
                insert_byte = '@';
                break;

There is also a python wrapper https://github.com/Roche/pyreadstat around Evan Miller's library.

paul-lilley · 2020-09-03T11:54:12Z

I can try to put together a PR to alter the decompression for control code 0x40 (and add code for 0x50), or (since we have other issues related to SAS files) should we look to use Evan Miller's library or pyreadstat rather than duplicating others' efforts within pandas? @jbrockmendel, @mroeschke - what do you think?

mroeschke · 2020-09-03T17:05:09Z

cc @bashtage

ofajardo · 2020-12-09T10:54:38Z

pyreadstat (it is a python wrapper around Evan Miller's Readstat) is already used in pandas.read_spss so I guess it would be an easy one to also provide it as an backend option for read_sas (see #5768, #26537).

This move would also solve other issues such as #37088, #35545, #22720, #18198

paul-lilley · 2020-12-12T15:08:23Z

@ofajardo I agree pyreadstat is also significantly faster than pandas cython implementation

bashtage · 2020-12-15T23:23:17Z

It sounds like a no brainer to move to pyreadstat given it is already a soft dep. The only question is what will the path look like. It would probably require adding optional support, deprecating pandas build-in, and then eventually removing the native reader.

ofajardo · 2020-12-17T06:11:56Z

Sounds like a plan!

jbrockmendel added the IO SAS SAS: read_sas label Jan 26, 2020

mroeschke added the Bug label Apr 5, 2020

ofajardo mentioned this issue Dec 9, 2020

Reading SAS time variables #22720

Open

This was referenced May 23, 2022

Fix SAS7BDAT run-length encoding formula #47099

Merged

Fix SAS 0x40 control byte handling and column name parsing #47113

Merged

jreback added this to the 1.5 milestone Jun 9, 2022

mroeschke closed this as completed in #47113 Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential bug in reading SAS files with CHAR (RLE) compression and many repeated characters #31243

Potential bug in reading SAS files with CHAR (RLE) compression and many repeated characters #31243

peterzsohar commented Jan 23, 2020 •

edited

Loading

Eric-Sommer commented Jan 23, 2020

peterzsohar commented Jan 24, 2020

Eric-Sommer commented Jan 24, 2020 •

edited

Loading

paul-lilley commented Sep 3, 2020 •

edited

Loading

paul-lilley commented Sep 3, 2020

mroeschke commented Sep 3, 2020

ofajardo commented Dec 9, 2020 •

edited

Loading

paul-lilley commented Dec 12, 2020 •

edited

Loading

bashtage commented Dec 15, 2020

ofajardo commented Dec 17, 2020

Potential bug in reading SAS files with CHAR (RLE) compression and many repeated characters #31243

Potential bug in reading SAS files with CHAR (RLE) compression and many repeated characters #31243

Comments

peterzsohar commented Jan 23, 2020 • edited Loading

DISCLAIMER

Output of pd.show_versions()

Eric-Sommer commented Jan 23, 2020

peterzsohar commented Jan 24, 2020

Eric-Sommer commented Jan 24, 2020 • edited Loading

paul-lilley commented Sep 3, 2020 • edited Loading

paul-lilley commented Sep 3, 2020

mroeschke commented Sep 3, 2020

ofajardo commented Dec 9, 2020 • edited Loading

paul-lilley commented Dec 12, 2020 • edited Loading

bashtage commented Dec 15, 2020

ofajardo commented Dec 17, 2020

peterzsohar commented Jan 23, 2020 •

edited

Loading

Output of `pd.show_versions()`

Eric-Sommer commented Jan 24, 2020 •

edited

Loading

paul-lilley commented Sep 3, 2020 •

edited

Loading

ofajardo commented Dec 9, 2020 •

edited

Loading

paul-lilley commented Dec 12, 2020 •

edited

Loading