Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review all Encoding usage for BOM compatibility #1027

Open
1 task done
paulirwin opened this issue Nov 17, 2024 · 1 comment · May be fixed by #1075
Open
1 task done

Review all Encoding usage for BOM compatibility #1027

paulirwin opened this issue Nov 17, 2024 · 1 comment · May be fixed by #1075
Assignees
Labels
is:task A chore to be done pri:normal

Comments

@paulirwin
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Task description

Java's StandardCharsets.UTF_8 does not write a Byte-Order Mark (BOM), while .NET's System.Text.Encoding.UTF8 does include a BOM by default. We have ensured that the IOUtils.CHARSET_UTF_8 does not include a BOM to match Java, and as part of #1018 we've added an internal Support class to allow for using StandardCharsets.UTF_8, but we need to review all usage of System.Text.Encoding.UTF8 to determine if it should be replaced with StandardCharsets.UTF_8 or IOUtils.CHARSET_UTF_8 (whatever best matches the corresponding Java Lucene code) to avoid BOM issues.

@paulirwin paulirwin added the is:task A chore to be done label Nov 17, 2024
@paulirwin paulirwin added this to the 4.8.0-beta00018 milestone Nov 17, 2024
@paulirwin paulirwin mentioned this issue Nov 17, 2024
4 tasks
@paulirwin paulirwin self-assigned this Dec 24, 2024
paulirwin added a commit to paulirwin/lucene.net that referenced this issue Dec 24, 2024
@paulirwin
Copy link
Contributor Author

I have reviewed usages of Encoding (most commonly Encoding.UTF8), and determined that most usages do not need to be changed. The following cases do not result in a BOM being generated:

The following cases ignore a BOM if present, and do not fail if there is not a BOM, and thus do not need to be changed to a BOM-less Encoding:

  • Any TextReader use (such as StreamReader)
  • IOUtils.GetDecodingReader(...)
  • Encoding.UTF8.GetString(byte[])
  • FileStream with FileAccess.Read

So you'll see in the PR that the amount of changes to address BOM issues are not very many; that's because most fall into those buckets above.

@paulirwin paulirwin linked a pull request Dec 24, 2024 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is:task A chore to be done pri:normal
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant