[BUG] b64decode does not handle whitespaces #3446

martinvuyk · 2024-09-03T14:42:59Z

Bug description

Detected by @lemire in PR #3443

Currently, b64decode does not appear to handle white-space characters. I would have expected the following to print 'Bonjour', it does not:

from base64 import b64decode
def main():
    var data = b64decode("Qm9 uam91cg==")
    print(data)

output:

Bo������

Steps to reproduce

Include relevant code snippet or link to code that did not work as expected.
If applicable, add screenshots to help explain the problem.
If using the Playground, name the pre-existing notebook that failed and the steps that led to failure.
Include anything else that might help us debug the issue.

System information

- What OS did you do install Mojo on ?
- Provide version information for Mojo by pasting the output of `mojo -v`
`mojo 2024.9.105`
- Provide Modular CLI version by pasting the output of `modular -v`

The text was updated successfully, but these errors were encountered:

lemire · 2024-09-03T14:49:16Z

Note that I did not report it as a bug because the documentation does not seem to imply that white space is handled.

A relevant specification is WHATWG Forgiving Base64 decoding:

https://infra.spec.whatwg.org/#forgiving-base64-decode

C#/.NET follows it, as well as the JavaScript's atob function. Possibly other systems follow it as well.

martinvuyk · 2024-09-03T15:01:16Z

Forgot to add the specs and what Python does which is what we try to follow

RFC 4648 is what python follows. Section 3.3

Implementations MUST reject the encoded data if it contains
   characters outside the base alphabet when interpreting base-encoded
   data, unless the specification referring to this document explicitly
   states otherwise.  Such specifications may instead state, as MIME
   does, that characters outside the base encoding alphabet should
   simply be ignored when interpreting data

Python:

from base64 import b64decode
print(b64decode("Qm9 uam91cg=="))

output:

b'Bonjour'

in the Python docs:

If validate is False (the default), characters that are neither in the normal base-64 alphabet nor 
the alternative alphabet are discarded prior to the padding check. If validate is True, these 
non-alphabet characters in the input result in a [binascii.Error](https://docs.python.org/3/library/binascii.html#binascii.Error).

lemire · 2024-09-03T15:12:08Z

@martinvuyk Right. So the base64 algorithm in simdutf can solve this at high speed. It is already used in production. (It is part of WebKit/Safari and Node.js, Bun, etc.)

martinvuyk added bug Something isn't working mojo-repo Tag all issues with this label labels Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] b64decode does not handle whitespaces #3446

[BUG] b64decode does not handle whitespaces #3446

martinvuyk commented Sep 3, 2024

lemire commented Sep 3, 2024

martinvuyk commented Sep 3, 2024

lemire commented Sep 3, 2024

[BUG] b64decode does not handle whitespaces #3446

[BUG] b64decode does not handle whitespaces #3446

Comments

martinvuyk commented Sep 3, 2024

Bug description

Steps to reproduce

System information

lemire commented Sep 3, 2024

martinvuyk commented Sep 3, 2024

lemire commented Sep 3, 2024