bin:encode-string - should the result have a BOM? #1751

michaelhkay · 2025-02-02T00:28:22Z

Test cases in the EXPath test suite using bin:encode-string with encoding=utf-16 include a BOM at the start of the output, but the spec says nothing about this. It's probably useful for some use case but a nuisance for others.

The text was updated successfully, but these errors were encountered:

michaelhkay · 2025-02-03T09:36:54Z

It's quite hard here to cater for all the possibilities. We should allow the user to specify byte order (utf-16be vs utf-16le) and we should allow them to control whether a BOM is added (it's wrong to add it if the binary value is then going to be appended to another with the same encoding). We need to think about backwards compatibility: is the test suite evidence that the expected 1.0 behaviour is to add a BOM? On decoding, similarly, we need to say what happens if a BOM is detected (but presumably it is only ever expected at offset 0, not at the offset we are reading from?). Should we actually use the BOM to detect the byte order? But that doesn't make sense unless we are reading from the start of the encoded data.

ChristianGruen · 2025-02-03T10:07:27Z

I was surprised to observe that Java’s CharsetEncoder.encode (on which I assume the initial implementation was built on) seems to add the BOM exlusively for UTF-16 and non-empty string input.

As the BOM inclusion is not part of the spec, I would drop it, and optionally make it available via explicit options.

michaelhkay · 2025-02-03T17:54:04Z

I'm inclined to say:

(a) Encodings UTF-16BE and UTF-16LE should be recognised, and UTF-16 on its own should be assumed to mean UTF-16BE.

(b) On reading, a BOM if present is decoded and returned like any other character

(c) On writing, we never write a BOM unless included in the data to be written (it's easy enough to write char(0xFEFF)).

(d) We provide a function read-BOM() which examines the start of the input and if a BOM is present returns (as a map) (a) the inferred encoding of the data, and (b) the offset at which the real data starts (ie the length of the BOM in octets).

ndw · 2025-02-03T18:01:13Z

That seems reasonable to me. If I pass a string that explicitly begins with the BOM, I guess that's what I want encoded.

michaelhkay linked a pull request Feb 5, 2025 that will close this issue

1751 Clarify BOM handling #1765

Open

michaelhkay added Blocked PR is blocked (has merge conflicts, doesn't format, etc.) PR Pending A PR has been raised to resolve this issue EXPath An issue related to the EXPath extension functions and removed Blocked PR is blocked (has merge conflicts, doesn't format, etc.) labels Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin:encode-string - should the result have a BOM? #1751

bin:encode-string - should the result have a BOM? #1751

michaelhkay commented Feb 2, 2025

michaelhkay commented Feb 3, 2025

ChristianGruen commented Feb 3, 2025 •

edited

Loading

michaelhkay commented Feb 3, 2025

ndw commented Feb 3, 2025

bin:encode-string - should the result have a BOM? #1751

bin:encode-string - should the result have a BOM? #1751

Comments

michaelhkay commented Feb 2, 2025

michaelhkay commented Feb 3, 2025

ChristianGruen commented Feb 3, 2025 • edited Loading

michaelhkay commented Feb 3, 2025

ndw commented Feb 3, 2025

ChristianGruen commented Feb 3, 2025 •

edited

Loading