Improved guess_sep() function to detect ';' and use less mem #842

duzun · 2017-10-16T11:56:04Z

I have some CSV docs that use ';' as a separator, but I don't have control of the source generating them, thus it is out of my control which separator they use.
I need the library to correctly guess it.

The previous version of guess_sep() doesn't ever detect ';'.

Plus I improved memory consumption of the function. It was counting all chars from the sample (max. 1024), but now it just counts the number of ',', '\t' and ';'.

SheetJSDev · 2017-10-16T14:30:45Z

Even though this doesn't align with Excel US behavior (see below), this change makes perfect sense. A few nits:

Can you unstage the change to the dist/xlsx.js file? Those are only changed when we bump the version.
Out of curiosity, is the memory usage from counting every character significant? The guesser only checks the first 1024 characters and the count object isn't leaking as a global.
The sort function can be reduced to:

		cc.sort(function(a, b) { return a[0] - b[0] || guess_sep_weights[a[1]] - guess_sep_weights[b[1]]; });

Excel Behavior

As usual, CSV is a mess :/ Excel's "CSV" output delimiter is governed by the Windows "Line Separator". This screenshot is from Windows 7 "Region and Language" for German (Germany):

Excel on the read side assumes that the list separator is the computer setting, so if you flip back to US the semicolons are not interpreted:

This difference needs to be documented somewhere in the README (after this PR is merged)

duzun · 2017-10-16T18:42:23Z

Out of curiosity, is the memory usage from counting every character significant? The guesser only checks the first 1024 characters and the count object isn't leaking as a global.

In the worst case scenario (eg. Chinese document), there would be 1024 different characters, which is (I guess) 1024 integers (4 or 8 bytes each) for keys and another 1024 Numbers (8 bytes each course it's double), which is about 12-16Kb of RAM (just guessing the minimum, not counting memory for Object). That could be an entire document!
Now imagine processing multiple Chinese documents on a heavily loaded server inside one process, so that there is hardly time for GC (garbage collector), 10 documents per second - in a minute it would be 7.2-9.6Mb. I think this is not significant, but I see no reason to consume extra resources for no reason :-).
On the other hand, what if someone wants to use not just first 1024 characters for the guess, but 10x (some futuristic guess)?

The sort function can be reduced to ...

I like it :-)! But does it improve readability? For me it does.

(MS)Excel on the read side assumes that the list separator is the computer setting...

This is one of the main reasons I don't use MS Excel for CSV. I don't use Windows either. And the code I write to process CSV is run mostly on servers, far away from Windows OS.
Programs generating CSV aren't necessarily (MS)Excel, nor the ones reading CSV.
Is there a particular reason why this library should align 100% to (MS)Excel's behavior?

This difference needs to be documented somewhere in the README (after this PR is merged)

True!

SheetJSDev · 2017-10-16T19:08:49Z

I see no reason to consume extra resources for no reason :-)

Good point!

But does it improve readability? For me it does.

Wouldn't have mentioned it if I didn't prefer it :)

Is there a particular reason why this library should align 100% to (MS)Excel's behavior?

When dealing with ambiguities we'd prefer to stick to Excel's behavior, but sensible deviations are acceptable so long as they are documented. I agree that semicolons should be treated as a possible delimiter. We'll update the README notes after the merge -- I think a comment would be needed in the Guessing File Type section under "Implementation Details".

After changing the sort function, can you squash down to one commit with the message:

DSV detect semicolon ';' delimiter

duzun · 2017-10-16T20:07:19Z

Not sure about the comment, as I'm not aware of other implementation details of this lib.

SheetJSDev · 2017-10-16T20:11:22Z

Thanks a lot! We're aiming to release the next version later today

duzun · 2017-10-16T20:42:54Z

Cool!

DSV detect semicolon ';' delimiter

51b4751

duzun force-pushed the master branch from f6c97b6 to 51b4751 Compare October 16, 2017 19:47

SheetJSDev merged commit 51b4751 into SheetJS:master Oct 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved guess_sep() function to detect ';' and use less mem #842

Improved guess_sep() function to detect ';' and use less mem #842

duzun commented Oct 16, 2017 •

edited

Loading

SheetJSDev commented Oct 16, 2017

duzun commented Oct 16, 2017

SheetJSDev commented Oct 16, 2017

duzun commented Oct 16, 2017

SheetJSDev commented Oct 16, 2017

duzun commented Oct 16, 2017

Improved guess_sep() function to detect ';' and use less mem #842

Improved guess_sep() function to detect ';' and use less mem #842

Conversation

duzun commented Oct 16, 2017 • edited Loading

SheetJSDev commented Oct 16, 2017

Excel Behavior

duzun commented Oct 16, 2017

SheetJSDev commented Oct 16, 2017

duzun commented Oct 16, 2017

SheetJSDev commented Oct 16, 2017

duzun commented Oct 16, 2017

duzun commented Oct 16, 2017 •

edited

Loading