Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[WIP] Literal values #50

Open
LPeter1997 opened this issue Jun 26, 2022 · 4 comments
Open

[WIP] Literal values #50

LPeter1997 opened this issue Jun 26, 2022 · 4 comments
Labels
Design document This one came out from an idea but considers many cases and tries to prove the usabity Syntax This issue is about syntax

Comments

@LPeter1997
Copy link
Member

LPeter1997 commented Jun 26, 2022

Important: Parts of this proposal depends on what we end up in the type inference issue (#42). If we end up deciding that literals always have a fixed type, then we can introduce the usual suffixes for literals. I'm personally not a fan of those, so for now, this proposal assumes that we can agree on literals being specified during inference.

Integer literals

  • Decimal integers would match the regex [0-9]+. Examples: 0, 123, 9625
  • Hexadecimal integers would match the regex 0x[0-9a-fA-F]+. Examples: 0x0, 0xbadc0fee, 0x2f5a
  • Binary integers would match the regex 0b[01]+. Examples: 0b0, 0b011101

We could introduce a separator character for large constants to make them more readable. Some languages use _ for this. The only rule would be that _ can't be the first significant digit. Examples: 12_000_000_000, 0xffff_0000, 0b1100_0000_0101_1110

Boolean literals

The keywords true and false.

Floating-point literals

They would have two forms, the normal decimal-separated form and a scientific form.

  • Decimal separated form would match the regex [0-9]+\.[0-9]+. Examples: 0.0, 0.123, 25.0, 62.73. Note that omitting either side completely is not enabled on purpose.
  • Scientific notation form would match the regex [0-9]+(\.[0-9]+)?[eE][+-]?[0-9]+. Examples: 10E3, 0.1e+4, 123.345E-12

Escape sequences

They would be enclosed in single-quotes. Escaping would be the usual \. Escape sequences would be:

  • \': Just a ". It does not have to be escaped in a string literal, but simplifies code-generation for the users. Since it's otherwise meaningless, it's essentially no effort to allow it in string literals. (inspired by C#)
  • \": Just a ". It does not have to be escaped in a character literal, but simplifies code-generation for the users. Since it's otherwise meaningless, it's essentially no effort to allow it in character literals. (inspired by C#)
  • \\: Escapes the \ to literally mean a \.
  • \[0abfnrtv]: Same as in every C-like programming language (reference)
  • TODO: How do we want Unicode escape sequences?

Character literals

They are enclosed in single-quotes ('), like in C#. Any visible character can be inside (no control characters), or an escape sequence.

String literals

They are enclosed in double-quotes ("), like in C#. Any visible character can be inside (no control characters), or an escape sequence.

Verbatim strings and string interpolation is not yet specified, that will come in a later issue. For now, I believe the default strings should allow for string interpolation, there should be no need for a separate annotation.

Issue for string interpolation is #53 .

@LPeter1997 LPeter1997 added Design document This one came out from an idea but considers many cases and tries to prove the usabity Syntax This issue is about syntax labels Jun 26, 2022
@WhiteBlackGoose
Copy link
Member

WhiteBlackGoose commented Jul 3, 2022

Regarding unicode escape sequences. Here's what C# (from that link has)
image

So it uses \u for utf16, \U for utf32, \x for variable lengths, which has ambiguity problems. Let us see how we can do it universally.

Prefix

I suggest \u8, \u16, \u32 for UTF-8, UTF-16, UTF-32 respectively.

Option 1: stick the numbers right after, e. g.

\u1699...

is UTF-16 of code 99...

Option 2: separate it with something

\u16_99...

It makes it more readable, but longer.

Avoiding ambiguity for encoding

Should it be variable length, or exactly 2, 4, 8 digits for those encodings?

Option 1: fixed lengths

\u8HH
\u16HHHH
\u32HHHHHHHH

Option 2: terminating symbol

E. g. u:

\u8H*u
\u16H*u
\u32H*u

Examples

Let's go over combinations

Opt 1 & Opt 1

\u870 = p
\u16AAAA = ꪪ
\u32001F47D = 👽

Opt 1 & Opt 2

\u870u = p
\u16AAAAu = ꪪ
\u321F47Du = 👽

Opt 2 & Opt 1

\u8_70 = p
\u16_AAAA = ꪪ
\u32_001F47D = 👽

Opt 2 & Opt 2

\u8_70u = p
\u16_AAAAu = ꪪ
\u32_1F47Du = 👽

@LPeter1997
Copy link
Member Author

My question is, do we want different encoding escapes? Wouldn't a single Unicode codepoint escape suffice? C++ has 4-character and 8-character Unicode escapes, unrelated to any kind of encoding. With that, only a single escape, like \u{[Hh]+} could work:

\u{70} = p
\u{AAAA} = ꪪ
\u{1F47D} = 👽

@svick
Copy link

svick commented Jul 3, 2022

How do escape sequences handle characters that require multiple code units in a given encoding?

For example, consider U+20AC Euro Sign. Could I specify it as both "\u16_20AC" and "\u8_E2\u8_82\u8_AC"? And would the value of "\u8_E2\u8_82\u8_AC".Length be 1? (Assuming that the natural type of a string literal is the UTF-16 System.String.)

@LPeter1997
Copy link
Member Author

We have discussed that on the server (and we'll document the results of that hopefully soon). So far we have ended up on the \u{...} idea to not to mess with encoding, we only specify codepoints there. The encoding will depend on what the escape sequence is embedded inside. For string literals, it would depend on what encoding we will use for strings.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Design document This one came out from an idea but considers many cases and tries to prove the usabity Syntax This issue is about syntax
Projects
None yet
Development

No branches or pull requests

3 participants