Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved guess_sep() function to detect ';' and use less mem #842

Merged
merged 1 commit into from
Oct 16, 2017

Conversation

duzun
Copy link
Contributor

@duzun duzun commented Oct 16, 2017

I have some CSV docs that use ';' as a separator, but I don't have control of the source generating them, thus it is out of my control which separator they use.
I need the library to correctly guess it.

The previous version of guess_sep() doesn't ever detect ';'.

Plus I improved memory consumption of the function. It was counting all chars from the sample (max. 1024), but now it just counts the number of ',', '\t' and ';'.

@SheetJSDev
Copy link
Contributor

Even though this doesn't align with Excel US behavior (see below), this change makes perfect sense. A few nits:

  • Can you unstage the change to the dist/xlsx.js file? Those are only changed when we bump the version.

  • Out of curiosity, is the memory usage from counting every character significant? The guesser only checks the first 1024 characters and the count object isn't leaking as a global.

  • The sort function can be reduced to:

		cc.sort(function(a, b) { return a[0] - b[0] || guess_sep_weights[a[1]] - guess_sep_weights[b[1]]; });

Excel Behavior

As usual, CSV is a mess :/ Excel's "CSV" output delimiter is governed by the Windows "Line Separator". This screenshot is from Windows 7 "Region and Language" for German (Germany):

Excel on the read side assumes that the list separator is the computer setting, so if you flip back to US the semicolons are not interpreted:

This difference needs to be documented somewhere in the README (after this PR is merged)

@duzun
Copy link
Contributor Author

duzun commented Oct 16, 2017

Out of curiosity, is the memory usage from counting every character significant? The guesser only checks the first 1024 characters and the count object isn't leaking as a global.

In the worst case scenario (eg. Chinese document), there would be 1024 different characters, which is (I guess) 1024 integers (4 or 8 bytes each) for keys and another 1024 Numbers (8 bytes each course it's double), which is about 12-16Kb of RAM (just guessing the minimum, not counting memory for Object). That could be an entire document!
Now imagine processing multiple Chinese documents on a heavily loaded server inside one process, so that there is hardly time for GC (garbage collector), 10 documents per second - in a minute it would be 7.2-9.6Mb. I think this is not significant, but I see no reason to consume extra resources for no reason :-).
On the other hand, what if someone wants to use not just first 1024 characters for the guess, but 10x (some futuristic guess)?

The sort function can be reduced to ...

I like it :-)! But does it improve readability? For me it does.

(MS)Excel on the read side assumes that the list separator is the computer setting...

This is one of the main reasons I don't use MS Excel for CSV. I don't use Windows either. And the code I write to process CSV is run mostly on servers, far away from Windows OS.
Programs generating CSV aren't necessarily (MS)Excel, nor the ones reading CSV.
Is there a particular reason why this library should align 100% to (MS)Excel's behavior?

This difference needs to be documented somewhere in the README (after this PR is merged)

True!

@SheetJSDev
Copy link
Contributor

I see no reason to consume extra resources for no reason :-)

Good point!

But does it improve readability? For me it does.

Wouldn't have mentioned it if I didn't prefer it :)

Is there a particular reason why this library should align 100% to (MS)Excel's behavior?

When dealing with ambiguities we'd prefer to stick to Excel's behavior, but sensible deviations are acceptable so long as they are documented. I agree that semicolons should be treated as a possible delimiter. We'll update the README notes after the merge -- I think a comment would be needed in the Guessing File Type section under "Implementation Details".

After changing the sort function, can you squash down to one commit with the message:

DSV detect semicolon ';' delimiter

@duzun
Copy link
Contributor Author

duzun commented Oct 16, 2017

Not sure about the comment, as I'm not aware of other implementation details of this lib.

@SheetJSDev SheetJSDev merged commit 51b4751 into SheetJS:master Oct 16, 2017
@SheetJSDev
Copy link
Contributor

Thanks a lot! We're aiming to release the next version later today

@duzun
Copy link
Contributor Author

duzun commented Oct 16, 2017

Cool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants