Skip to content

onDataString can mangle multi-byte characters #37

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
nwolverson opened this issue Nov 25, 2021 · 1 comment · Fixed by #51
Closed

onDataString can mangle multi-byte characters #37

nwolverson opened this issue Nov 25, 2021 · 1 comment · Fixed by #51
Labels
type: documentation Improvements or additions to documentation.

Comments

@nwolverson
Copy link
Contributor

nwolverson commented Nov 25, 2021

onDataString is easy to use but is unfortunately a subtle foot-gun.

This function calls the underlying node API returning a buffer, and then calls Buffer.toString on the result.

But if a character would span two data events, each Buffer.toString will replace the initial/trailling code units with the replacement character.

As it stands, onDataString should be documented with a clear warning, it is not suitable for general purpose use but only streams that are either guaranteed to have single-byte encoded characters/known short lengths.

The comment on setEncoding unfortunately recommends this:

"Where possible, you should try to use onDataString instead of this function."

I think if this recommended onData, it might be more clear to the user that something fishy is going on.

nwolverson added a commit to nwolverson/purescript-language-server that referenced this issue Nov 26, 2021
A character that is multiple UTF-8 code units that got split across multiple 'data' events
was converted into replacement characters due to Buffer.toString being called on the chunks
individually. Instead collect all chunks and call toString after joining them.

purescript-node/purescript-node-streams#37
@JordanMartinez JordanMartinez added the type: documentation Improvements or additions to documentation. label Dec 7, 2021
@jamesdbrock
Copy link
Member

If the stream has an end, you can read all the stream data and then convert to a String with https://pursuit.purescript.org/packages/purescript-node-streams-aff/1.1.0/docs/Node.Stream.Aff#v:readAll

If you know how many bytes long the string will be before you read it, then you can wait until all the bytes come in with https://pursuit.purescript.org/packages/purescript-node-streams-aff/1.1.0/docs/Node.Stream.Aff#v:readN

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
type: documentation Improvements or additions to documentation.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants