Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

bin:encode-string - should the result have a BOM? #1751

Open
michaelhkay opened this issue Feb 2, 2025 · 4 comments · May be fixed by #1765
Open

bin:encode-string - should the result have a BOM? #1751

michaelhkay opened this issue Feb 2, 2025 · 4 comments · May be fixed by #1765
Labels
EXPath An issue related to the EXPath extension functions PR Pending A PR has been raised to resolve this issue

Comments

@michaelhkay
Copy link
Contributor

Test cases in the EXPath test suite using bin:encode-string with encoding=utf-16 include a BOM at the start of the output, but the spec says nothing about this. It's probably useful for some use case but a nuisance for others.

@michaelhkay
Copy link
Contributor Author

It's quite hard here to cater for all the possibilities. We should allow the user to specify byte order (utf-16be vs utf-16le) and we should allow them to control whether a BOM is added (it's wrong to add it if the binary value is then going to be appended to another with the same encoding). We need to think about backwards compatibility: is the test suite evidence that the expected 1.0 behaviour is to add a BOM? On decoding, similarly, we need to say what happens if a BOM is detected (but presumably it is only ever expected at offset 0, not at the offset we are reading from?). Should we actually use the BOM to detect the byte order? But that doesn't make sense unless we are reading from the start of the encoded data.

@ChristianGruen
Copy link
Contributor

ChristianGruen commented Feb 3, 2025

I was surprised to observe that Java’s CharsetEncoder.encode (on which I assume the initial implementation was built on) seems to add the BOM exlusively for UTF-16 and non-empty string input.

As the BOM inclusion is not part of the spec, I would drop it, and optionally make it available via explicit options.

@michaelhkay
Copy link
Contributor Author

I'm inclined to say:

(a) Encodings UTF-16BE and UTF-16LE should be recognised, and UTF-16 on its own should be assumed to mean UTF-16BE.

(b) On reading, a BOM if present is decoded and returned like any other character

(c) On writing, we never write a BOM unless included in the data to be written (it's easy enough to write char(0xFEFF)).

(d) We provide a function read-BOM() which examines the start of the input and if a BOM is present returns (as a map) (a) the inferred encoding of the data, and (b) the offset at which the real data starts (ie the length of the BOM in octets).

@ndw
Copy link
Contributor

ndw commented Feb 3, 2025

That seems reasonable to me. If I pass a string that explicitly begins with the BOM, I guess that's what I want encoded.

@michaelhkay michaelhkay linked a pull request Feb 5, 2025 that will close this issue
@michaelhkay michaelhkay added Blocked PR is blocked (has merge conflicts, doesn't format, etc.) PR Pending A PR has been raised to resolve this issue EXPath An issue related to the EXPath extension functions and removed Blocked PR is blocked (has merge conflicts, doesn't format, etc.) labels Feb 5, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
EXPath An issue related to the EXPath extension functions PR Pending A PR has been raised to resolve this issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants