From b01d508c3ee4548710f660415f5497450289e976 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kat=20March=C3=A1n?= Date: Sun, 19 Jan 2025 14:30:15 -0800 Subject: [PATCH] =?UTF-8?q?Create=20spec-text=20for=20=E2=80=9Cclassic?= =?UTF-8?q?=E2=80=9D=20RFC=20experience?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- src/spec-text | 1512 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1512 insertions(+) create mode 100644 src/spec-text diff --git a/src/spec-text b/src/spec-text new file mode 100644 index 0000000..3709f66 --- /dev/null +++ b/src/spec-text @@ -0,0 +1,1512 @@ + + + + +KDL Community K. Marchán + Microsoft + KDL Contributors + 19 January 2025 + + + The KDL Document Language + draft-marchan-kdl2-latest + +Abstract + + KDL is a node-oriented document language. Its niche and purpose + overlaps with XML, and as do many of its semantics. You can use KDL + both as a configuration language, and a data exchange or storage + format, if you so choose. + + This is the formal specification for KDL, including the intended data + model and the grammar. + + This document describes KDL version KDL 2.0.0. It was released on + 2024-12-21. It is the latest stable version of the language, and + will only be edited for minor copyedits or major errata. + +About This Document + + This note is to be removed before publishing as an RFC. + + Status information for this document may be found at + https://datatracker.ietf.org/doc/draft-marchan-kdl2/. + + information can be found at https://kdl.dev/. + + Source for this draft and an issue tracker can be found at + https://github.com/kdl-org/kdl. + +License + + This work is licensed under Creative Commons Attribution-ShareAlike + 4.0 International. To view a copy of this license, visit + https://creativecommons.org/licenses/by-sa/4.0/ + +Table of Contents + + 1. Compatibility . . . . . . . . . . . . . . . . . . . . . . . . 3 + 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 + 3. Components . . . . . . . . . . . . . . . . . . . . . . . . . 3 + 3.1. Document . . . . . . . . . . . . . . . . . . . . . . . . 3 + + + + +Marchán & KDL Contributors Experimental [Page 1] + + KDL January 2025 + + + 3.1.1. Example . . . . . . . . . . . . . . . . . . . . . . . 4 + 3.2. Node . . . . . . . . . . . . . . . . . . . . . . . . . . 4 + 3.2.1. Example . . . . . . . . . . . . . . . . . . . . . . . 5 + 3.3. Line Continuation . . . . . . . . . . . . . . . . . . . . 5 + 3.3.1. Example . . . . . . . . . . . . . . . . . . . . . . . 5 + 3.4. Property . . . . . . . . . . . . . . . . . . . . . . . . 5 + 3.5. Argument . . . . . . . . . . . . . . . . . . . . . . . . 6 + 3.5.1. Example . . . . . . . . . . . . . . . . . . . . . . . 6 + 3.6. Children Block . . . . . . . . . . . . . . . . . . . . . 6 + 3.6.1. Example . . . . . . . . . . . . . . . . . . . . . . . 6 + 3.7. Value . . . . . . . . . . . . . . . . . . . . . . . . . . 6 + 3.8. Type Annotation . . . . . . . . . . . . . . . . . . . . . 7 + 3.8.1. Reserved Type Annotations for Numbers Without + Decimals: . . . . . . . . . . . . . . . . . . . . . . 7 + 3.8.2. Reserved Type Annotations for Numbers With + Decimals: . . . . . . . . . . . . . . . . . . . . . . 8 + 3.8.3. Reserved Type Annotations for Strings: . . . . . . . 8 + 3.8.4. Examples . . . . . . . . . . . . . . . . . . . . . . 9 + 3.9. String . . . . . . . . . . . . . . . . . . . . . . . . . 9 + 3.10. Identifier String . . . . . . . . . . . . . . . . . . . . 10 + 3.10.1. Non-initial characters . . . . . . . . . . . . . . . 10 + 3.10.2. Non-identifier characters . . . . . . . . . . . . . 11 + 3.11. Quoted String . . . . . . . . . . . . . . . . . . . . . . 11 + 3.11.1. Escapes . . . . . . . . . . . . . . . . . . . . . . 11 + 3.12. Multi-line String . . . . . . . . . . . . . . . . . . . . 13 + 3.12.1. Newline Normalization . . . . . . . . . . . . . . . 14 + 3.12.2. Examples . . . . . . . . . . . . . . . . . . . . . . 14 + 3.12.3. Interaction with Whitespace Escapes . . . . . . . . 16 + 3.13. Raw String . . . . . . . . . . . . . . . . . . . . . . . 17 + 3.13.1. Example . . . . . . . . . . . . . . . . . . . . . . 17 + 3.14. Number . . . . . . . . . . . . . . . . . . . . . . . . . 18 + 3.14.1. Keyword Numbers . . . . . . . . . . . . . . . . . . 19 + 3.15. Boolean . . . . . . . . . . . . . . . . . . . . . . . . . 19 + 3.15.1. Example . . . . . . . . . . . . . . . . . . . . . . 19 + 3.16. Null . . . . . . . . . . . . . . . . . . . . . . . . . . 20 + 3.16.1. Example . . . . . . . . . . . . . . . . . . . . . . 20 + 3.17. Whitespace . . . . . . . . . . . . . . . . . . . . . . . 20 + 3.17.1. Single-line comments . . . . . . . . . . . . . . . . 21 + 3.17.2. Multi-line comments . . . . . . . . . . . . . . . . 22 + 3.17.3. Slashdash comments . . . . . . . . . . . . . . . . . 22 + 3.18. Newline . . . . . . . . . . . . . . . . . . . . . . . . . 22 + 3.19. Disallowed Literal Code Points . . . . . . . . . . . . . 23 + 4. Full Grammar . . . . . . . . . . . . . . . . . . . . . . . . 24 + 4.1. Grammar language . . . . . . . . . . . . . . . . . . . . 26 + Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 27 + + + + + + +Marchán & KDL Contributors Experimental [Page 2] + + KDL January 2025 + + +1. Compatibility + + KDL 2.0 is designed such that for any given KDL document written as + KDL 1.0 (./SPEC_v1.md) or KDL 2.0, the parse will either fail + completely, or, if the parse succeeds, the data represented by a v1 + or v2 parser will be identical. This means that it's safe to use a + fallback parsing strategy in order to support both v1 and v2 + simultaneously. For example, node "foo" is a valid node in both + versions, and should be represented identically by parsers. + + A version marker /- kdl-version 2 (or 1) _MAY_ be added to the + beginning of a KDL document, optionally preceded by the BOM, and + parsers _MAY_ use that as a hint as to which version to parse the + document as. + +2. Introduction + + KDL is a node-oriented document language. Its niche and purpose + overlaps with XML, and as do many of its semantics. You can use KDL + both as a configuration language, and a data exchange or storage + format, if you so choose. + + The bulk of this document is dedicated to a long-form description of + all Components (Section 3) of a KDL document. There is also a much + more terse Grammar (Section 4) at the end of the document that covers + most of the rules, with some semantic exceptions involving the data + model. + + KDL is designed to be easy to read _and_ easy to implement. + + In this document, references to "left" or "right" refer to directions + in the _data stream_ towards the beginning or end, respectively; in + other words, the directions if the data stream were only ASCII text. + They do not refer to the writing direction of text, which can flow in + either direction, depending on the characters used. + +3. Components + +3.1. Document + + The toplevel concept of KDL is a Document. A Document is composed of + zero or more Nodes (Section 3.2), separated by newlines and + whitespace, and eventually terminated by an EOF. + + All KDL documents should be UTF-8 encoded and conform to the + specifications in this document. + + + + + +Marchán & KDL Contributors Experimental [Page 3] + + KDL January 2025 + + +3.1.1. Example + + The following is a document composed of two toplevel nodes: + + foo { + bar + } + baz + +3.2. Node + + Being a node-oriented language means that the real core component of + any KDL document is the "node". Every node must have a name, which + must be a String (Section 3.9). + + The name may be preceded by a Type Annotation (Section 3.8) to + further clarify its type, particularly in relation to its parent + node. (For example, clarifying that a particular date child node is + for the _publication_ date, rather than the last-modified date, with + (published)date.) + + Following the name are zero or more Arguments (Section 3.5) or + Properties (Section 3.4), separated by either whitespace + (Section 3.17) or a slash-escaped line continuation (Section 3.3). + Arguments and Properties may be interspersed in any order, much like + is common with positional arguments vs options in command line tools. + Collectively, Arguments and Properties may be referred to as + "Entries". + + Children (Section 3.6) can be placed after the name and the optional + Entries, possibly separated by either whitespace or a slash-escaped + line continuation. + + Arguments are ordered relative to each other and that order must be + preserved in order to maintain the semantics. Properties between + Arguments do not affect Argument ordering. + + By contrast, Properties _SHOULD NOT_ be assumed to be presented in a + given order. Children (Section 3.6) should be used if an order- + sensitive key/value data structure must be represented in KDL. Cf. + JSON objects preserving key order. + + Nodes _MAY_ be prefixed with Slashdash (Section 3.17.3) to "comment + out" the entire node, including its properties, arguments, and + children, and make it act as plain whitespace, even if it spreads + across multiple lines. + + + + + +Marchán & KDL Contributors Experimental [Page 4] + + KDL January 2025 + + + Finally, a node is terminated by either a Newline (Section 3.18), a + semicolon (;), the end of a child block (}) or the end of the file/ + stream (an EOF). + +3.2.1. Example + + // `foo` will have an Argument value list like `[1, 3]`. + foo 1 key=val 3 { + bar + (role)baz 1 2 + } + +3.3. Line Continuation + + Line continuations allow Nodes (Section 3.2) to be spread across + multiple lines. + + A line continuation is a \ character followed by zero or more + whitespace items (including multiline comments) and an optional + single-line comment. It must be terminated by a Newline + (Section 3.18) (including the Newline that is part of single-line + comments). + + Following a line continuation, processing of a Node can continue as + usual. + +3.3.1. Example + + my-node 1 2 \ // comments are ok after \ + 3 4 // This is the actual end of the Node. + +3.4. Property + + A Property is a key/value pair attached to a Node (Section 3.2). A + Property is composed of a String (Section 3.9), followed immediately + by an equals sign (=, U+003D), and then a Value (Section 3.7). + + Properties should be interpreted left-to-right, with rightmost + properties with identical names overriding earlier properties. That + is: + + node a=1 a=2 + + In this example, the node's a value must be 2, not 1. + + No other guarantees about order should be expected by implementers. + Deserialized representations may iterate over properties in any order + and still be spec-compliant. + + + +Marchán & KDL Contributors Experimental [Page 5] + + KDL January 2025 + + + Properties _MAY_ be prefixed with /- to "comment out" the entire + token and make it act as plain whitespace, even if it spreads across + multiple lines. + +3.5. Argument + + An Argument is a bare Value (Section 3.7) attached to a Node + (Section 3.2), with no associated key. It shares the same space as + Properties (Section 3.4), and may be interleaved with them. + + A Node may have any number of Arguments, which should be evaluated + left to right. KDL implementations _MUST_ preserve the order of + Arguments relative to each other (not counting Properties). + + Arguments _MAY_ be prefixed with /- to "comment out" the entire token + and make it act as plain whitespace, even if it spreads across + multiple lines. + +3.5.1. Example + + my-node 1 2 3 a b c + +3.6. Children Block + + A children block is a block of Nodes (Section 3.2), surrounded by { + and }. They are an optional part of nodes, and create a hierarchy of + KDL nodes. + + Regular node termination rules apply, which means multiple nodes can + be included in a single-line children block, as long as they're all + terminated by ;. + +3.6.1. Example + + parent { + child1 + child2 + } + + parent { child1; child2; } + +3.7. Value + + A value is either: a String (Section 3.9), a Number (Section 3.14), a + Boolean (Section 3.15), or Null (Section 3.16). + + + + + + +Marchán & KDL Contributors Experimental [Page 6] + + KDL January 2025 + + + Values _MUST_ be either Arguments (Section 3.5) or values of + Properties (Section 3.4). Only String (Section 3.9) values may be + used as Node (Section 3.2) names or Property (Section 3.4) keys. + + Values (both as arguments and in properties) _MAY_ be prefixed by a + single Type Annotation (Section 3.8). + +3.8. Type Annotation + + A type annotation is a prefix to any Node Name (Section 3.2) or Value + (Section 3.7) that includes a _suggestion_ of what type the value is + _intended_ to be treated as, or as a _context-specific elaboration_ + of the more generic type the node name indicates. + + Type annotations are written as a set of ( and ) with a single String + (Section 3.9) in it. It may contain Whitespace after the ( and + before the ), and may be separated from its target by Whitespace. + + KDL does not specify any restrictions on what implementations might + do with these annotations. They are free to ignore them, or use them + to make decisions about how to interpret a value. + + Additionally, the following type annotations MAY be recognized by KDL + parsers and, if used, SHOULD interpret these types as follows: + +3.8.1. Reserved Type Annotations for Numbers Without Decimals: + + Signed integers of various sizes (the number is the bit size): + + * i8 + + * i16 + + * i32 + + * i64 + + * i128 + + Unsigned integers of various sizes (the number is the bit size): + + * u8 + + * u16 + + * u32 + + * u64 + + + +Marchán & KDL Contributors Experimental [Page 7] + + KDL January 2025 + + + * u128 + + Platform-dependent integer types, both signed and unsigned: + + * isize + + * usize + +3.8.2. Reserved Type Annotations for Numbers With Decimals: + + IEEE 754 floating point numbers, both single (32) and double (64) + precision: + + * f32 + + * f64 + + IEEE 754-2008 decimal floating point numbers + + * decimal64 + + * decimal128 + +3.8.3. Reserved Type Annotations for Strings: + + * date-time: ISO8601 date/time format. + + * time: "Time" section of ISO8601. + + * date: "Date" section of ISO8601. + + * duration: ISO8601 duration format. + + * decimal: IEEE 754-2008 decimal string format. + + * currency: ISO 4217 currency code. + + * country-2: ISO 3166-1 alpha-2 country code. + + * country-3: ISO 3166-1 alpha-3 country code. + + * country-subdivision: ISO 3166-2 country subdivision code. + + * email: RFC5322 email address. + + * idn-email: RFC6531 internationalized email address. + + * hostname: RFC1123 internet hostname (only ASCII segments) + + + +Marchán & KDL Contributors Experimental [Page 8] + + KDL January 2025 + + + * idn-hostname: RFC5890 internationalized internet hostname (only xn + ---prefixed ASCII "punycode" segments, or non-ASCII segments) + + * ipv4: RFC2673 dotted-quad IPv4 address. + + * ipv6: RFC2373 IPv6 address. + + * url: RFC3986 URI. + + * url-reference: RFC3986 URI Reference. + + * irl: RFC3987 Internationalized Resource Identifier. + + * irl-reference: RFC3987 Internationalized Resource Identifier + Reference. + + * url-template: RFC6570 URI Template. + + * uuid: RFC4122 UUID. + + * regex: Regular expression. Specific patterns may be + implementation-dependent. + + * base64: A Base64-encoded string, denoting arbitrary binary data. + +3.8.4. Examples + + node (u8)123 + node prop=(regex).* + (published)date "1970-01-01" + (contributor)person name="Foo McBar" + +3.9. String + + Strings in KDL represent textual UTF-8 Values (Section 3.7). A + String is either an Identifier String (Section 3.10) (like foo), a + Quoted String (Section 3.11) (like "foo") or a Multi-Line String + (Section 3.12). Both Quoted and Multiline strings come in normal and + Raw String (Section 3.13) variants (like #"foo"#): + + * Identifier Strings let you write short, "single-word" strings with + a minimum of syntax + + * Quoted Strings let you write strings "like normal", with + whitespace and escapes. + + * Multi-Line Strings let you write strings across multiple lines and + with indentation that's not part of the string value. + + + +Marchán & KDL Contributors Experimental [Page 9] + + KDL January 2025 + + + * Raw Strings don't allow any escapes, allowing you to not worry + about the string's content containing anything that might look + like an escape. + + Strings _MUST_ be represented as UTF-8 values. + + Strings _MUST NOT_ include the code points for disallowed literal + code points (Section 3.19) directly. Quoted and Multi-Line Strings + may include these code points as _values_ by representing them with + their corresponding \u{...} escape. + +3.10. Identifier String + + An Identifier String (sometimes referred to as just an "identifier") + is composed of any Unicode Scalar Value (https://unicode.org/ + glossary/#unicode_scalar_value) other than non-initial characters + (Section 3.10.1), followed by any number of Unicode Scalar Values + other than non-identifier characters (Section 3.10.2). + + A handful of patterns are disallowed, to avoid confusion with other + values: + + * idents that appear to start with a Number (Section 3.14) (like + 1.0v2 or -1em) or the "almost a number" pattern of a decimal point + without a leading digit (like .1). + + * idents that are the language keywords (inf, -inf, nan, true, + false, and null) without their leading #. + + Identifiers that match these patterns _MUST_ be treated as a syntax + error; such values can only be written as quoted or raw strings. The + precise details of the identifier syntax is specified in the Full + Grammar in Section 4. + +3.10.1. Non-initial characters + + The following characters cannot be the first character in an + Identifier String (Section 3.10): + + * Any decimal digit (0-9) + + * Any non-identifier characters (Section 3.10.2) + + Additionally, the following initial characters impose limitations on + subsequent characters: + + + + + + +Marchán & KDL Contributors Experimental [Page 10] + + KDL January 2025 + + + * the + and - characters can only be used as an initial character if + the second character is _not_ a digit. If the second character is + ., then the third character must _not_ be a digit. + + * the . character can only be used as an initial character if the + second character is _not_ a digit. + + This allows identifiers to look like --this or .md, and removes the + ambiguity of having an identifier look like a number. + +3.10.2. Non-identifier characters + + The following characters cannot be used anywhere in a Identifier + String (Section 3.10): + + * Any of (){}[]/\"#;= + + * Any Whitespace (Section 3.17) or Newline (Section 3.18). + + * Any disallowed literal code points (Section 3.19) in KDL + documents. + +3.11. Quoted String + + A Quoted String is delimited by " on either side of any number of + literal string characters except unescaped " and \. + + Literal Newline (Section 3.18) characters can only be included if + they are Escaped Whitespace (Section 3.11.1.1), which discards them + from the string value. Actually including a newline in the value + requires using a newline escape sequence, like \n, or using a Multi- + Line String (Section 3.12) which is actually designed for strings + stretching across multiple lines. + + Like Identifier Strings, Quoted Strings _MUST NOT_ include any of the + disallowed literal code-points (Section 3.19) as code points in their + body. + + Quoted Strings have a Raw String (Section 3.13) variant, which + disallows escapes. + +3.11.1. Escapes + + In addition to literal code points, a number of "escapes" are + supported in Quoted Strings. "Escapes" are the character \ followed + by another character, and are interpreted as described in the + following table: + + + + +Marchán & KDL Contributors Experimental [Page 11] + + KDL January 2025 + + + +==============+=========+=========================================+ + | Name | Escape | Code Pt | + +==============+=========+=========================================+ + | Line Feed | \n | U+000A | + +--------------+---------+-----------------------------------------+ + | Carriage | \r | U+000D | + | Return | | | + +--------------+---------+-----------------------------------------+ + | Character | \t | U+0009 | + | Tabulation | | | + | (Tab) | | | + +--------------+---------+-----------------------------------------+ + | Reverse | \\ | U+005C | + | Solidus | | | + | (Backslash) | | | + +--------------+---------+-----------------------------------------+ + | Quotation | \" | U+0022 | + | Mark (Double | | | + | Quote) | | | + +--------------+---------+-----------------------------------------+ + | Backspace | \b | U+0008 | + +--------------+---------+-----------------------------------------+ + | Form Feed | \f | U+000C | + +--------------+---------+-----------------------------------------+ + | Space | \s | U+0020 | + +--------------+---------+-----------------------------------------+ + | Unicode | \u{(1-6 | Code point described by hex characters, | + | Escape | hex | as long as it represents a Unicode | + | | chars)} | Scalar Value (https://unicode.org/ | + | | | glossary/#unicode_scalar_value) | + +--------------+---------+-----------------------------------------+ + | Whitespace | See | N/A | + | Escape | below | | + +--------------+---------+-----------------------------------------+ + + Table 1 + +3.11.1.1. Escaped Whitespace + + In addition to escaping individual characters, \ can also escape + whitespace. When a \ is followed by one or more literal whitespace + characters, the \ and all of that whitespace are discarded. For + example, + + "Hello World" + + and + + + + +Marchán & KDL Contributors Experimental [Page 12] + + KDL January 2025 + + + "Hello \ World" + + are semantically identical. See whitespace (Section 3.17) and + newlines (Section 3.18) for how whitespace is defined. + + Note that only literal whitespace is escaped; whitespace escapes (\n + and such) are retained. For example, these strings are all + semantically identical: + + "Hello\ \nWorld" + + "Hello\n\ + World" + + "Hello\nWorld" + + """ + Hello + World + """ + +3.11.1.2. Invalid escapes + + Except as described in the escapes table, above, \ _MUST NOT_ precede + any other characters in a string. + +3.12. Multi-line String + + Multi-Line Strings support multiple lines with literal, non-escaped + Newlines. They must use a special multi-line syntax, and they + automatically "dedent" the string, allowing its value to be indented + to a visually matching level as desired. + + A Multi-Line String is opened and closed by _three_ double-quote + characters, like """. Its first line _MUST_ immediately start with a + Newline (Section 3.18) after its opening """. Its final line _MUST_ + contain only whitespace before the closing """. All in-between lines + that contain non-newline, non-whitespace characters _MUST_ start with + _at least_ the exact same whitespace as the final line (precisely + matching codepoints, not merely counting characters or "size"); they + may contain additional whitespace following this prefix. The lines + in between may contain unescaped " (but no unescaped """ as this + would close the string). + + The value of the Multi-Line String omits the first and last Newline, + the Whitespace of the last line, and the matching Whitespace prefix + on all intermediate lines. The first and last Newline can be the + same character (that is, empty multi-line strings are legal). + + + +Marchán & KDL Contributors Experimental [Page 13] + + KDL January 2025 + + + In other words, the final line specifies the whitespace prefix that + will be removed from all other lines. + + Multi-line Strings that do not immediately start with a Newline and + whose final """ is not preceeded by optional whitespace and a Newline + are illegal. This also means that """ may not be used for a single- + line String (e.g. """foo"""). + +3.12.1. Newline Normalization + + Literal Newline sequences in Multi-line Strings must be normalized to + a single U+000A (LF) during deserialization. This means, for + example, that CR LF becomes a single LF during parsing. + + This normalization does not apply to non-literal Newlines entered + using escape sequences. That is: + + multi-line """ + \r\n[CRLF] + foo[CRLF] + """ + + becomes: + + single-line "\r\n\nfoo" + + For clarity: this normalization applies to each individual Newline + sequence. That is, the literal sequence CRLF CRLF becomes LF LF, not + LF. + +3.12.2. Examples + +3.12.2.1. Indented multi-line string + + multi-line """ + foo + This is the base indentation + bar + """ + + This example's string value will be: + + foo + This is the base indentation + bar + + which is equivalent to + + + + +Marchán & KDL Contributors Experimental [Page 14] + + KDL January 2025 + + + " foo\nThis is the base indentation\n bar" + + when written as a single-line string. + +3.12.2.2. Shorter last-line indent + + If the last line wasn't indented as far, it won't dedent the rest of + the lines as much: + + multi-line """ + foo + This is no longer on the left edge + bar + """ + + This example's string value will be: + + foo + This is no longer on the left edge + bar + + Equivalent to + + " foo\n This is no longer on the left edge\n bar" + +3.12.2.3. Empty lines + + Empty lines can contain any whitespace, or none at all, and will be + reflected as empty in the value: + + multi-line """ + Indented a bit + + A second indented paragraph. + """ + + This example's string value will be: + + Indented a bit. + + A second indented paragraph. + + Equivalent to + + "Indented a bit.\n\nA second indented paragraph." + + + + + + +Marchán & KDL Contributors Experimental [Page 15] + + KDL January 2025 + + +3.12.2.4. Syntax errors + + The following yield *syntax errors*: + + multi-line """can't be single line""" + + multi-line """ + closing quote with non-whitespace prefix""" + + multi-line """stuff + """ + + // Every line must share the exact same prefix as the closing line. + multi-line """[\n] + [tab]a[\n] + [space][space]b[\n] + [space][tab][\n] + [tab]""" + +3.12.3. Interaction with Whitespace Escapes + + Multi-line strings support the same mechanism for escaping whitespace + as Quoted Strings. + + When processing a Multi-line String, implementations MUST dedent the + string _after_ resolving all whitespace escapes, but _before_ + resolving other backslash escapes. This means a whitespace escape + that attempts to escape the final line's newline and/or whitespace + prefix can be invalid: if removing escaped whitespace places the + closing """ on a line with non-whitespace characters, this escape is + invalid. + + For example, the following example is illegal: + + """ + foo + bar\ + """ + + // equivalent to + """ + foo + bar""" + + while the following example is allowed + + + + + + +Marchán & KDL Contributors Experimental [Page 16] + + KDL January 2025 + + + """ + foo \ + bar + baz + \ """ + + // equivalent to + """ + foo bar + baz + """ + +3.13. Raw String + + Both Quoted (Section 3.11) and Multi-Line Strings (Section 3.12) have + Raw String variants, which are identical in syntax except they do not + support \-escapes. This includes line-continuation escapes (\ + ws + collapsing to nothing). They otherwise share the same properties as + far as literal Newline (Section 3.18) characters go, multi-line + rules, and the requirement of UTF-8 representation. + + The Raw String variants are indicated by preceding the strings's + opening quotes with one or more # characters. The string is then + closed by its normal closing quotes, followed by a _matching_ number + of # characters. This means that the string may contain any + combination of " and # characters other than its closing delimiter + (e.g., if a raw string starts with ##", it can contain " or "#, but + not "## or "###). + + Like other Strings, Raw Strings _MUST NOT_ include any of the + disallowed literal code-points (Section 3.19) as code points in their + body. Unlike with Quoted Strings, these cannot simply be escaped, + and are thus unrepresentable when using Raw Strings. + +3.13.1. Example + + just-escapes #"\n will be literal"# + + The string contains the literal characters \n will be literal. + + quotes-and-escapes ##"hello\n\r\asd"#world"## + + The string contains the literal characters hello\n\r\asd"#world + + + + + + + + +Marchán & KDL Contributors Experimental [Page 17] + + KDL January 2025 + + + raw-multi-line #""" + Here's a """ + multiline string + """ + without escapes. + """# + + The string contains the value + + Here's a """ + multiline string + """ + without escapes. + + or equivalently, + + "Here's a \"\"\"\n multiline string\n \"\"\"\nwithout escapes." + + as a Quoted String. + +3.14. Number + + Numbers in KDL represent numerical Values (Section 3.7). There is no + logical distinction in KDL between real numbers, integers, and + floating point numbers. It's up to individual implementations to + determine how to represent KDL numbers. + + There are five syntaxes for Numbers: Keywords, Decimal, Hexadecimal, + Octal, and Binary. + + * All non-Keyword (Section 3.14.1) numbers may optionally start with + one of - or +, which determine whether they'll be positive or + negative. + + * Binary numbers start with 0b and only allow 0 and 1 as digits, + which may be separated by _. They represent numbers in radix 2. + + * Octal numbers start with 0o and only allow digits between 0 and 7, + which may be separated by _. They represent numbers in radix 8. + + * Hexadecimal numbers start with 0x and allow digits between 0 and + 9, as well as letters A through F, in either lower or upper case, + which may be separated by _. They represent numbers in radix 16. + + * Decimal numbers are a bit more special: + + - They have no radix prefix. + + + + +Marchán & KDL Contributors Experimental [Page 18] + + KDL January 2025 + + + - They use digits 0 through 9, which may be separated by _. + + - They may optionally include a decimal separator ., followed by + more digits, which may again be separated by _. + + - They may optionally be followed by E or e, an optional - or +, + and more digits, to represent an exponent value. + + Note that, similar to JSON and some other languages, numbers without + an integer digit (such as .1) are illegal. They must be written with + at least one integer digit, like 0.1. (These patterns are also + disallowed from Identifier Strings (Section 3.10), to avoid + confusion.) + +3.14.1. Keyword Numbers + + There are three special "keyword" numbers included in KDL to + accomodate the widespread use of IEEE 754 + (https://en.wikipedia.org/wiki/IEEE_754) floats: + + * #inf - floating point positive infinity. + + * #-inf - floating point negative infinity. + + * #nan - floating point NaN/Not a Number. + + To go along with this and prevent foot guns, the bare Identifier + Strings (Section 3.10) inf, -inf, and nan are considered illegal + identifiers and should yield a syntax error. + + The existence of these keywords does not imply that any numbers be + represented as IEEE 754 floats. These are simply for clarity and + convenience for any implementation that chooses to represent their + numbers in this way. + +3.15. Boolean + + A boolean Value (Section 3.7) is either the symbol #true or #false. + These _SHOULD_ be represented by implementation as boolean logical + values, or some approximation thereof. + +3.15.1. Example + + my-node #true value=#false + + + + + + + +Marchán & KDL Contributors Experimental [Page 19] + + KDL January 2025 + + +3.16. Null + + The symbol #null represents a null Value (Section 3.7). It's up to + the implementation to decide how to represent this, but it generally + signals the "absence" of a value. + +3.16.1. Example + + my-node #null key=#null + +3.17. Whitespace + + The following characters should be treated as non-Newline + (Section 3.18) white space + (https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt): + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Marchán & KDL Contributors Experimental [Page 20] + + KDL January 2025 + + + +===========================+=========+ + | Name | Code Pt | + +===========================+=========+ + | Character Tabulation | U+0009 | + +---------------------------+---------+ + | Space | U+0020 | + +---------------------------+---------+ + | No-Break Space | U+00A0 | + +---------------------------+---------+ + | Ogham Space Mark | U+1680 | + +---------------------------+---------+ + | En Quad | U+2000 | + +---------------------------+---------+ + | Em Quad | U+2001 | + +---------------------------+---------+ + | En Space | U+2002 | + +---------------------------+---------+ + | Em Space | U+2003 | + +---------------------------+---------+ + | Three-Per-Em Space | U+2004 | + +---------------------------+---------+ + | Four-Per-Em Space | U+2005 | + +---------------------------+---------+ + | Six-Per-Em Space | U+2006 | + +---------------------------+---------+ + | Figure Space | U+2007 | + +---------------------------+---------+ + | Punctuation Space | U+2008 | + +---------------------------+---------+ + | Thin Space | U+2009 | + +---------------------------+---------+ + | Hair Space | U+200A | + +---------------------------+---------+ + | Narrow No-Break Space | U+202F | + +---------------------------+---------+ + | Medium Mathematical Space | U+205F | + +---------------------------+---------+ + | Ideographic Space | U+3000 | + +---------------------------+---------+ + + Table 2 + +3.17.1. Single-line comments + + Any text after //, until the next literal Newline (Section 3.18) is + "commented out", and is considered to be Whitespace (Section 3.17). + + + + + +Marchán & KDL Contributors Experimental [Page 21] + + KDL January 2025 + + +3.17.2. Multi-line comments + + In addition to single-line comments using //, comments can also be + started with /* and ended with */. These comments can span multiple + lines. They are allowed in all positions where Whitespace + (Section 3.17) is allowed and can be nested. + +3.17.3. Slashdash comments + + Finally, a special kind of comment called a "slashdash", denoted by + /-, can be used to comment out entire _components_ of a KDL document + logically, and have those elements not be included as part of the + parsed document data. + + Slashdash comments can be used before the following, including before + their type annotations, if present: + + * A Node (Section 3.2): the entire Node is treated as Whitespace, + including all props, args, and children. + + * An Argument (Section 3.5): the Argument value is treated as + Whitespace. + + * A Property (Section 3.4) key: the entire property, including both + key and value, is treated as Whitespace. A slashdash of just the + property value is not allowed. + + * A Children Block (Section 3.6): the entire block, including all + children within, is treated as Whitespace. Only other children + blocks, whether slashdashed or not, may follow a slashdashed + children block. + + A slashdash may be be followed by any amount of whitespace, including + newlines and comments (other than other slashdashes), before the + element that it comments out. + +3.18. Newline + + The following character sequences should be treated as new lines + (https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter- + 5/#G41643): + + + + + + + + + + +Marchán & KDL Contributors Experimental [Page 22] + + KDL January 2025 + + + +=========+===============================+=================+ + | Acronym | Name | Code Pt | + +=========+===============================+=================+ + | CRLF | Carriage Return and Line Feed | U+000D + U+000A | + +---------+-------------------------------+-----------------+ + | CR | Carriage Return | U+000D | + +---------+-------------------------------+-----------------+ + | LF | Line Feed | U+000A | + +---------+-------------------------------+-----------------+ + | NEL | Next Line | U+0085 | + +---------+-------------------------------+-----------------+ + | VT | Vertical tab | U+000B | + +---------+-------------------------------+-----------------+ + | FF | Form Feed | U+000C | + +---------+-------------------------------+-----------------+ + | LS | Line Separator | U+2028 | + +---------+-------------------------------+-----------------+ + | PS | Paragraph Separator | U+2029 | + +---------+-------------------------------+-----------------+ + + Table 3 + + Note that for the purpose of new lines, the specific sequence CRLF is + considered _a single newline_. + +3.19. Disallowed Literal Code Points + + The following code points may not appear literally anywhere in the + document. They may be represented in Strings (but not Raw Strings) + using Unicode Escapes (Section 3.11.1) (\u{...}, except for non + Unicode Scalar Value, which can't be represented even as escapes). + + * The codepoints U+0000-0008 or the codepoints U+000E-001F (various + control characters). + + * U+007F (the Delete control character). + + * Any codepoint that is not a Unicode Scalar Value + (https://unicode.org/glossary/#unicode_scalar_value) + (U+D800-DFFF). + + * U+200E-200F, U+202A-202E, and U+2066-2069, the unicode "direction + control" characters (https://www.w3.org/International/questions/ + qa-bidi-unicode-controls) + + * U+FEFF, aka Zero-width Non-breaking Space (ZWNBSP)/Byte Order Mark + (BOM), except as the first code point in a document. + + + + +Marchán & KDL Contributors Experimental [Page 23] + + KDL January 2025 + + +4. Full Grammar + + This is the full official grammar for KDL and should be considered + authoritative if something seems to disagree with the text above. + The grammar language syntax is defined in Section 4.1. + + document := bom? version? nodes + + // Nodes + nodes := (line-space* node)* line-space* + + base-node := slashdash? type? node-space* string + (node-space+ slashdash? node-prop-or-arg)* + // slashdashed node-children must always be after props and args. + (node-space+ slashdash node-children)* + (node-space+ node-children)? + (node-space+ slashdash node-children)* + node-space* + node := base-node node-terminator + final-node := base-node node-terminator? + + // Entries + node-prop-or-arg := prop | value + node-children := '{' nodes final-node? '}' + node-terminator := single-line-comment | newline | ';' | eof + + prop := string node-space* '=' node-space* value + value := type? node-space* (string | number | keyword) + type := '(' node-space* string node-space* ')' + + // Strings + string := identifier-string | quoted-string | raw-string ¶ + + identifier-string := unambiguous-ident | signed-ident | dotted-ident + unambiguous-ident := + ((identifier-char - digit - sign - '.') identifier-char*) + - disallowed-keyword-strings + signed-ident := + sign ((identifier-char - digit - '.') identifier-char*)? + dotted-ident := + sign? '.' ((identifier-char - digit) identifier-char*)? + identifier-char := + unicode - unicode-space - newline - [\\/(){};\[\]"#=] + - disallowed-literal-code-points + disallowed-keyword-identifiers := + 'true' | 'false' | 'null' | 'inf' | '-inf' | 'nan' + + quoted-string := + + + +Marchán & KDL Contributors Experimental [Page 24] + + KDL January 2025 + + + '"' single-line-string-body '"' | + '"""' newline + (multi-line-string-body newline)? + (unicode-space | ws-escape)* '"""' + single-line-string-body := (string-character - newline)* + multi-line-string-body := (('"' | '""')? string-character)* + string-character := + '\\' (["\\bfnrts] | + 'u{' hex-unicode '}') | + ws-escape | + [^\\"] - disallowed-literal-code-points + ws-escape := '\\' (unicode-space | newline)+ + hex-digit := [0-9a-fA-F] + hex-unicode := hex-digit{1, 6} - surrogates + surrogates := [dD][8-9a-fA-F]hex-digit{2} + // U+D800-DFFF: D 8 00 + // D F FF + + raw-string := '#' raw-string-quotes '#' | '#' raw-string '#' + raw-string-quotes := + '"' single-line-raw-string-body '"' | + '"""' newline + (multi-line-raw-string-body newline)? + unicode-space* '"""' + single-line-raw-string-body := + '' | + (single-line-raw-string-char - '"') + single-line-raw-string-char*? | + '"' (single-line-raw-string-char - '"') + single-line-raw-string-char*? + single-line-raw-string-char := + unicode - newline - disallowed-literal-code-points + multi-line-raw-string-body := + (unicode - disallowed-literal-code-points)*? + + // Numbers + number := keyword-number | hex | octal | binary | decimal + + decimal := sign? integer ('.' integer)? exponent? + exponent := ('e' | 'E') sign? integer + integer := digit (digit | '_')* + digit := [0-9] + sign := '+' | '-' + + hex := sign? '0x' hex-digit (hex-digit | '_')* + octal := sign? '0o' [0-7] [0-7_]* + binary := sign? '0b' ('0' | '1') ('0' | '1' | '_')* + + + + +Marchán & KDL Contributors Experimental [Page 25] + + KDL January 2025 + + + // Keywords and booleans. + keyword := boolean | '#null' + keyword-number := '#inf' | '#-inf' | '#nan' + boolean := '#true' | '#false' + + // Specific code points + bom := '\u{FEFF}' + disallowed-literal-code-points := + See Table (Disallowed Literal Code Points) + unicode := Any Unicode Scalar Value + unicode-space := See Table + (All White_Space unicode characters which are not `newline`) + + // Comments + single-line-comment := '//' ^newline* (newline | eof) + multi-line-comment := '/*' commented-block + commented-block := + '*/' | (multi-line-comment | '*' | '/' | [^*/]+) commented-block + slashdash := '/-' line-space* + + // Whitespace + ws := unicode-space | multi-line-comment + escline := '\\' ws* (single-line-comment | newline | eof) + newline := See Table (All Newline White_Space) + // Whitespace where newlines are allowed. + line-space := node-space | newline | single-line-comment + // Whitespace within nodes, + // where newline-ish things must be esclined. + node-space := ws* escline ws* | ws+ + + // Version marker + version := + '/-' unicode-space* 'kdl-version' unicode-space+ ('1' | '2') + unicode-space* newline + +4.1. Grammar language + + The grammar language syntax is a combination of ABNF with some regex + spice thrown in. Specifically: + + * Single quotes (') are used to denote literal text. \ within a + literal string is used for escaping other single-quotes, for + initiating unicode characters using hex values (\u{FEFF}), and for + escaping \ itself (\\). + + + + + + + +Marchán & KDL Contributors Experimental [Page 26] + + KDL January 2025 + + + * * is used for "zero or more", + is used for "one or more", and ? + is used for "zero or one". Per standard regex semantics, * and + + are _greedy_; they match as many instances as possible without + failing the match. + + * *? (used only in raw strings) indicates a _non-greedy_ match; it + matches as _few_ instances as possible without failing the match. + + * ¶ is a _cut point_. It always matches and consumes no characters, + but once matched, the parser is not allowed to backtrack past that + point in the source. If a parser would rewind past the cut point, + it must instead fail the overall parse, as if it had run out of + options. (This is only used with the raw-string production, to + ensure the first instance of the appropriate closing quote + sequence is guaranteed to be the end of the raw string, rather + than allowing it to potentially consume more of the document + unexpectedly.) + + * () can be used to group matches that must be matched together. + + * a | b means a or b, whichever matches first. If multiple items + are before a |, they are a single group. a b c | d is equivalent + to (a b c) | d. + + * [] are used for regex-style character matches, where any character + between the brackets will be a single match. \ is used to escape + \, [, and ]. They also support character ranges (0-9), and + negation (^) + + * - is used for "except for" or "minus" whatever follows it. For + example, a - 'x' means "any a, except something that matches the + literal 'x'". + + * The prefix ^ means "something that does not match" whatever + follows it. For example, ^foo means "must not match foo". + + * A single definition may be split over multiple lines. Newlines + are treated as spaces. + + * // followed by text on its own line is used as comment syntax. + +Authors' Addresses + + Katerina Zoé Marchán Salvá + Microsoft + + + The KDL Contributors + + + +Marchán & KDL Contributors Experimental [Page 27]