Skip to content

Character and string token definitions need updating. #626

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
5 of 6 tasks
ehuss opened this issue Jun 26, 2019 · 4 comments
Open
5 of 6 tasks

Character and string token definitions need updating. #626

ehuss opened this issue Jun 26, 2019 · 4 comments
Labels
A-lexer Area: Lexical specification

Comments

@ehuss
Copy link
Contributor

ehuss commented Jun 26, 2019

There are multiple issues here. Some of this has changed in 1.37 via rust-lang/rust#60793.

  • RAW_BYTE_STRING_LITERAL no longer allows bare CR (new 1.37). Input format #1459

  • "Raw string" and "raw byte string" needs to be updated that CRLF is converted to LF (new 1.37). Input format #1459

  • Several tokens need to sync the English text with the "Lexer" definition.

    • STRING_LITERAL indicates several rules (like isolated CR's are not allowed), but the text does not mention any of those restrictions.
    • CHAR_LITERAL says "single Unicode character…except U+0027" which is not complete.
    • RAW_STRING_LITERAL does not allow bare CR's.
    • BYTE_LITERAL escapes are not described.
    • BYTE_STRING_LITERAL restrictions are not described.
    • In general, just make sure they are all in sync!
  • Typo in RAW_BYTE_STRING_CONTENT, points to RAW_STRING_CONTENT when it should be RAW_BYTE_STRING_CONTENT. Fixes minor errors #818

  • I cannot find anywhere that mentions CRLF in a string is converted to LF. Am I blind? Input format #1459

  • The description for string continuations says "\ immediately before U+000A", but it can also be before CRLF. How should this be handled? I haven't looked at how it is implemented, but are all CRLF's translated everywhere? Should there just be a blanket statement somewhere about this, to avoid having to discuss it in every string literal definition? Input format #1459

I may be missing some things here. Need to very thoroughly review everything to make sure it is correct and up-to-date with the changes from 60793.

@ehuss ehuss added the A-lexer Area: Lexical specification label Jun 26, 2019
@ehuss
Copy link
Contributor Author

ehuss commented Jul 22, 2019

See also rust-lang/rust#62865

@mattheww
Copy link
Contributor

rust-lang/rust#118699 (comment)
should be helpful.

@mattheww
Copy link
Contributor

mattheww commented Jan 22, 2024

The current description says that forms like 'a'b are acceptable as a BYTE_LITERAL with a suffix, but in fact they're rejected (to avoid confusion with two LIFETIME_LABEL tokens).

The current description says that forms like 'ab'c are acceptable as two LIFETIME_LABEL tokens, but in fact they're rejected ("character literal may only contain one codepoint"; the c is taken as a suffix).

Perhaps this could be documented via another reserved form.

@mattheww
Copy link
Contributor

A form like b"\u{00a0}" is rejected at lexing time ("unicode escape in byte string").

But as it doesn't match either BYTE_STRING_LITERAL or RESERVED_TOKEN_DOUBLE_QUOTE, the current description says there's a valid tokenisation as the identifier b followed by "\u{00a0}".

So if we keep on with the current mechanism for documenting such rejected tokens, I think we'd need yet more reserved forms.

There are probably other similar cases. I think after rust-lang/rust#119172 a
C string literal containing a NUL is one.

@mattheww mattheww mentioned this issue Jan 28, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
A-lexer Area: Lexical specification
Projects
None yet
Development

No branches or pull requests

2 participants