-
-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
What's the unit of character in Point #21
Comments
We’re sticking to JS here. What’s confusing? The reason for column being offset + 1, if we’re dealing with one line, is because column is 1-based whereas offset is 0-based? |
Well, I know that. I'm not talking about this.
I think if you are sticking to JS, at least you could mention that 'char' refers to UTF-16 code unit in the unist spec. And I'm not sticking to JS—I'm writing a reimplement of remark in rust because I'm not satisfied with its performance, so I need to know more precisely. |
@titansnow I don’t know what you know. I’m spending my free fun time helping out people I’ve never met online. You mentioned it + being confused, so I put one and one together.
Great idea. Could you create a PR for that?
Again, I have no clue what you’re up to. It sounds like you feel I wronged you somehow by creating free software?
|
Sorry, I didn't mean to irritate you. |
@titansnow Thanks, I’m sorry I got irritated. I’m sure you meant no harm! I’m interested in tackling this, let me know if you have any ideas on how to spec this properly! |
Well, I found some examples in other specs. e.g. CommonMark Spec:
e.g. HTML Spec:
So in unist spec, I think it can be like:
|
Done! |
In Point section, it's mentions:
What's the unit of 'character' and 'column'? Is it UTF-16 code unit (used in JavaScript) or Unicode code point? See Wikipedia:
I tried using remark to parse this markdown piece:
Here,
𠮷
is one Unicode code point that can not be encoded into one UTF-16 code unit. In JavaScript, because String uses UTF-16, so:But in other languages like Python:
As for remark, the above markdown piece is parsed into:
The
column
ofend
is 5, while theoffset
ofend
is 4, that means remark treat this text four 'chars' long, measured in UTF16 code units.So what's the unit of character? It's so confused.
The text was updated successfully, but these errors were encountered: