Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

What's the unit of character in Point #21

Closed
ghost opened this issue Nov 8, 2018 · 7 comments
Closed

What's the unit of character in Point #21

ghost opened this issue Nov 8, 2018 · 7 comments
Labels
📚 area/docs This affects documentation 💪 phase/solved Post is done 🦋 type/enhancement This is great to have

Comments

@ghost
Copy link

ghost commented Nov 8, 2018

In Point section, it's mentions:

The line field (1-indexed integer) represents a line in a source file. The column field (1-indexed integer) represents a column in a source file. The offset field (0-indexed integer) represents a character in a source file.

What's the unit of 'character' and 'column'? Is it UTF-16 code unit (used in JavaScript) or Unicode code point? See Wikipedia:

[UTF-16] encoding is variable-length, as code points are encoded with one or two 16-bit code units

I tried using remark to parse this markdown piece:

a𠮷b

Here, 𠮷 is one Unicode code point that can not be encoded into one UTF-16 code unit. In JavaScript, because String uses UTF-16, so:

'a𠮷b'.length
//=> 4

But in other languages like Python:

len('a𠮷b')
#=> 3

As for remark, the above markdown piece is parsed into:

{
  "type": "text",
  "value": "a𠮷b",
  "position": {
    "start": {
      "line": 1,
      "column": 1,
      "offset": 0
    },
    "end": {
      "line": 1,
      "column": 5,
      "offset": 4
    },
    "indent": []
  }
}

The column of end is 5, while the offset of end is 4, that means remark treat this text four 'chars' long, measured in UTF16 code units.

So what's the unit of character? It's so confused.

@wooorm
Copy link
Member

wooorm commented Nov 8, 2018

We’re sticking to JS here. What’s confusing?

The reason for column being offset + 1, if we’re dealing with one line, is because column is 1-based whereas offset is 0-based?

@ghost
Copy link
Author

ghost commented Nov 8, 2018

The reason for column being offset + 1, if we’re dealing with one line, is because column is 1-based whereas offset is 0-based?

Well, I know that. I'm not talking about this.

We’re sticking to JS here. What’s confusing?

I think if you are sticking to JS, at least you could mention that 'char' refers to UTF-16 code unit in the unist spec. And I'm not sticking to JS—I'm writing a reimplement of remark in rust because I'm not satisfied with its performance, so I need to know more precisely.

@wooorm
Copy link
Member

wooorm commented Nov 8, 2018

@titansnow I don’t know what you know. I’m spending my free fun time helping out people I’ve never met online. You mentioned it + being confused, so I put one and one together.

I think if you are sticking to JS, at least you could mention that 'char' refers to UTF-16 code unit in the unist spec

Great idea. Could you create a PR for that?

And I'm not sticking to JS—I'm writing a reimplement of remark in rust because I'm not satisfied with its performance, so I need to know more precisely.

Again, I have no clue what you’re up to. It sounds like you feel I wronged you somehow by creating free software?
Feel free to use the remark library as inspiration. The license says you can do so. But do note the last paragraph:

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

@ghost
Copy link
Author

ghost commented Nov 9, 2018

Sorry, I didn't mean to irritate you.

@wooorm
Copy link
Member

wooorm commented Nov 10, 2018

@titansnow Thanks, I’m sorry I got irritated. I’m sure you meant no harm!

I’m interested in tackling this, let me know if you have any ideas on how to spec this properly!

@ghost
Copy link
Author

ghost commented Nov 11, 2018

Well, I found some examples in other specs.

e.g. CommonMark Spec:

A character is a Unicode code point. Although some code points (for example, combining accents) do not correspond to characters in an intuitive sense, all code points count as characters for purposes of this spec.

e.g. HTML Spec:

The term code unit is used as defined in the Web IDL specification: a 16 bit unsigned integer, the smallest atomic component of a DOMString. (This is a narrower definition than the one used in Unicode, and is not the same as a code point.)
The term Unicode code point means a Unicode scalar value where possible, and an isolated surrogate code point when not. When a conformance requirement is defined in terms of characters or Unicode code points, a pair of code units consisting of a high surrogate followed by a low surrogate must be treated as the single code point represented by the surrogate pair, but isolated surrogates must each be treated as the single code point with the value of the surrogate.
In this specification, the term character, when not qualified as Unicode character, is synonymous with the term Unicode code point.
The term Unicode character is used to mean a Unicode scalar value (i.e. any Unicode code point that is not a surrogate code point).
The code-unit length of a string is the number of code units in that string.

So in unist spec, I think it can be like:

The term character means a (UTF-16) code unit which is defined in the Web IDL specification.

@wooorm wooorm closed this as completed in 49032b9 Nov 18, 2018
@wooorm
Copy link
Member

wooorm commented Nov 18, 2018

Done!

@wooorm wooorm added ⛵️ status/released 📚 area/docs This affects documentation 🦋 type/enhancement This is great to have labels Aug 12, 2019
@wooorm wooorm added the 💪 phase/solved Post is done label Apr 12, 2021
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
📚 area/docs This affects documentation 💪 phase/solved Post is done 🦋 type/enhancement This is great to have
Development

No branches or pull requests

1 participant