Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] b64decode does not handle whitespaces #3446

Open
martinvuyk opened this issue Sep 3, 2024 · 3 comments
Open

[BUG] b64decode does not handle whitespaces #3446

martinvuyk opened this issue Sep 3, 2024 · 3 comments
Labels
bug Something isn't working mojo-repo Tag all issues with this label

Comments

@martinvuyk
Copy link
Contributor

Bug description

Detected by @lemire in PR #3443

Currently, b64decode does not appear to handle white-space characters. I would have expected the following to print 'Bonjour', it does not:

from base64 import b64decode
def main():
    var data = b64decode("Qm9 uam91cg==")
    print(data)

output:

Bo������

Steps to reproduce

  • Include relevant code snippet or link to code that did not work as expected.
  • If applicable, add screenshots to help explain the problem.
  • If using the Playground, name the pre-existing notebook that failed and the steps that led to failure.
  • Include anything else that might help us debug the issue.

System information

- What OS did you do install Mojo on ?
- Provide version information for Mojo by pasting the output of `mojo -v`
`mojo 2024.9.105`
- Provide Modular CLI version by pasting the output of `modular -v`
@martinvuyk martinvuyk added bug Something isn't working mojo-repo Tag all issues with this label labels Sep 3, 2024
@lemire
Copy link

lemire commented Sep 3, 2024

Note that I did not report it as a bug because the documentation does not seem to imply that white space is handled.

A relevant specification is WHATWG Forgiving Base64 decoding:

https://infra.spec.whatwg.org/#forgiving-base64-decode

C#/.NET follows it, as well as the JavaScript's atob function. Possibly other systems follow it as well.

@martinvuyk
Copy link
Contributor Author

Forgot to add the specs and what Python does which is what we try to follow

RFC 4648 is what python follows. Section 3.3

Implementations MUST reject the encoded data if it contains
   characters outside the base alphabet when interpreting base-encoded
   data, unless the specification referring to this document explicitly
   states otherwise.  Such specifications may instead state, as MIME
   does, that characters outside the base encoding alphabet should
   simply be ignored when interpreting data

Python:

from base64 import b64decode
print(b64decode("Qm9 uam91cg=="))

output:

b'Bonjour'

in the Python docs:

If validate is False (the default), characters that are neither in the normal base-64 alphabet nor 
the alternative alphabet are discarded prior to the padding check. If validate is True, these 
non-alphabet characters in the input result in a [binascii.Error](https://docs.python.org/3/library/binascii.html#binascii.Error).

@lemire
Copy link

lemire commented Sep 3, 2024

@martinvuyk Right. So the base64 algorithm in simdutf can solve this at high speed. It is already used in production. (It is part of WebKit/Safari and Node.js, Bun, etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mojo-repo Tag all issues with this label
Projects
None yet
Development

No branches or pull requests

2 participants