Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

VM handling of leading whitespace in multiline strings does not match the spec. #23020

Open
bwilkerson opened this issue Mar 27, 2015 · 23 comments
Assignees
Labels
area-front-end Use area-front-end for front end / CFE / kernel format related issues. P4 type-bug Incorrect behavior (everything from a crash to more subtle misbehavior)

Comments

@bwilkerson
Copy link
Member

bwilkerson commented Mar 27, 2015

In section 16.5, the specification says:

If the first line of a multiline string consists solely of the whitespace characters defined by the production WHITESPACE 20.1, possibly prefixed by \, then that line is ignored, including the new line at its end.

The specification doesn't state this explicitly, but I believe that this means that leading whitespace should be removed before any other processing has been done, such as processing escape sequences (otherwise, the whitespace characters would no longer be prefixed by backslashes). The VM appears to be processing escape characters and even performing interpolation before removing leading whitespace.

I have a test case (language/multiline_strings_test) that I will commit after this issue is created (I need the issue number for the status files).

@bwilkerson
Copy link
Member Author

The test was added in https://codereview.chromium.org/1042523004/.

@DartBot
Copy link

DartBot commented Mar 27, 2015

This comment was originally written by @mhausner


Set owner to @mhausner.
Added Accepted label.

@DartBot
Copy link

DartBot commented Mar 31, 2015

This comment was originally written by @mhausner


I don't think this bug (and the test referenced in comment 1) is valid.

In Dart string literals, a backslash followed by a character that is not a recognized escape sequence is identical to the character, i.e. the backslash is swallowed. "\h" is the same as "h".

The multiline string """
bla"""

Is the same as """bla""", but """\
bla"""

is not the same. The first \ converts to a single , which is not a whitespace character. The single \ that results from this escape sequence is not used again to build escape sequences with the following characters.


cc @gbracha.

@bwilkerson
Copy link
Member Author

The first \ converts to a single , which is not a whitespace character. The single \ that results from
this escape sequence is not used again to build escape sequences with the following characters.

I'm not sure what you're referring to. Nowhere does the test assume that the result of applying an escape sequence would then be combined with the following character as another escape sequence.

@bwilkerson
Copy link
Member Author

It appears that we need you to weigh in on whether the added test correctly matches the intent of the specification. If it does, please re-assign this issue to hausner@google.com.


cc @mhausner.
Set owner to @gbracha.

@gbracha
Copy link
Contributor

gbracha commented Apr 1, 2015

I commented in the review. What we do next depends less on the spec than on existing behavior of both VM and dart2js.

@stereotype441
Copy link
Member

Gilad--

I visited https://codereview.chromium.org/1042523004/ but I don't see any comments from you. Is it possible that your comments are still in draft form and haven't been sent yet?

@gbracha
Copy link
Contributor

gbracha commented Apr 2, 2015

Yeah, I forgot to publish the comment. The more I think about it, the more I think that spec pedantry is not the point.

We need to do what ordinary people can understand. I think the VM is ok in this case, and the test is flawed. Re-assigning to analyzer.


Removed Area-VM label.
Added Area-Analyzer label.

@bwilkerson
Copy link
Member Author

bwilkerson commented Apr 3, 2015

You wrote:

The idea was (I believe) to ignore invisible stuff people may have typed after
the opening of the string. Maybe we should have left it at that, but someone
noted that say, tab and \ followed by tab were the same. But the argument would
apply to \t. It is also much easier to understand that way.

The wording should be clarified, but basically, if it any kind of whitespace,
you ignore it. So while Paul and Brian's interpretation is sensible vis-a-vis
the wording, I don't think an ordinary human would expect such subtlety.

I think the idea was (or should have been) to allow people to left-justify the first character of the string so that the alignment of characters in the resulting string would be obvious when looking at the source code. For example, consider the following:

  WindowManager.openModelDialog(title: "Error", message: """You obviously don't know what you're doing
and shouldn't be using a computer!""");

compared to:

  WindowManager.openModelDialog(title: "Error", message: """
You obviously don't know what you're doing
and shouldn't be using a computer!""");

The reason for allowing whitespace before the end-of-line marker to be removed is because users can't see it and therefore won't realize that it is there (and that the string they think they're getting isn't what they're actually getting).

It seems to me that it would be much cleaner if the spec said:

If the first line of a multiline string, before escape characters are processed and before interpolation occurs, consists solely of the whitespace characters defined by the production WHITESPACE, then that line is ignored, including the new line at its end.

Honestly, how many people are going to say to themselves "I want my string to contain a leading tab character that will be stripped out, but if I don't use an escape sequence nobody will be able to tell that it's there, so I'll explicitly type '\t'"? :-) Obviously that's a silly example, but what's the use case for wanting some escape sequence to be ignored? Similarly, what's the use case for wanting white space to be stripped out after interpolation?


Removed Area-Analyzer label.
Added Area-Language label.

@gbracha
Copy link
Contributor

gbracha commented Apr 3, 2015

As I said, maybe we should have left it at that. But we didn't because someone had an issue. We don't change the spec as easily as in the past. So the question is how to interpret it, not whether to change it. I've interpreted the existing spec in the way that is most obvious for users.


Removed Area-Language label.
Added Area-Analyzer label.

@bwilkerson
Copy link
Member Author

cc @gbracha.
Removed the owner.
Removed Priority-Unassigned label.
Added Priority-Medium, Triaged labels.

@lrhn
Copy link
Member

lrhn commented Apr 8, 2015

If the specification requires an external interpretation to decide what it means, then I do think the specification should be changed. Otherwise it's just ambiguous, which, for a spec, means it's broken.
Per the broken window theory, I don't think we should let even small broken stuff stay broken, no matter how insignificant it might seem on a grand scale.

And I'd vote for changing it to ignore only whitespace, no backslashes.

@bwilkerson
Copy link
Member Author

I've interpreted the existing spec in the way that is most obvious for users.

For the record, I strongly disagree. I don't think most users will expect character escapes and interpolation to be performed before leading whitespace is removed.

@gbracha
Copy link
Contributor

gbracha commented Apr 8, 2015

The spec will be clarified. It won't be changed. The bar for spec changes is higher than in the early wild west days.

@lrhn
Copy link
Member

lrhn commented May 4, 2015

If "clarifying" the spec will make one current implementation invalid, it's not just a clarification. The spec is (at best) ambiguous, and disambiguating in one direction isn't any worse than disambiguating it in the other direction.
Whatever the intent of the current wording, the intent is not in the spec.
And I, for one, do not find "escapes like \t are ignored" to be the most obvious interpretation, because there is no reason for preferring that interpretation.
The spec refers to WHITESPACE which is a tokenization concept, not a parser concept. I expect that to only be matched by actual WHITESPACE characters in the source.

So I'm arguing that the disambiguation should pick the sanest interpretation, which is to not allow \t escapes as part of the "invisible initial whitespace" of a multiline string. It'll still just be a clarification, it's just clarifying that this was the intent. Interpreting escapes first also opens the can on interpreting interpolations first: """${' '}""" - should that space be removed as well? (It's a single compile-time constant, so it is possible).

Also, if a multi-line string is on a single line, the spec also requires its content to be ignored if it's pure WHITESPACE, for example """ """. It's probably only intended to happen if the string literal actually contains a NEWLINE character (outside of interpolations).

And while I appreciate that this might not be the most important part of the Dart language, anything the spec says must be implemented by the implementations as well, and explained to users. I can't explain that """\t""" is an empty string :)

@lrhn
Copy link
Member

lrhn commented May 4, 2015

See also issue #14073 and issue #19514.
I accept that backslash-before-WHITESPACE is now spec, and it allows people with C-style habits to write:
  var s = """\
    long lines of long text,
    and so""";
(which is of dubious value, but it's there).

For the record, I'd consider CR LF a NEWLINE for the tokenizer, but BACKSLASH CR BACKSLASH NL is two individual NEWLINE characters, each prefixed by backslash. The latter backslash should be part of the multiline string.

@lrhn
Copy link
Member

lrhn commented May 6, 2015

And another curiosity.
The spec says "These can be represented directly, but since for most characters
prefixing by backslash is an identity, we allow those forms as well."
However, the wording applies to both raw and cooked strings, so the following:
  r"""\ \
raw string""";
will ignore backslashes (like they were identity escapes) in a raw string where backslashes are not escape characters!

I think the "prefix by " should be changed to at least only apply to cooked strings.
(I'd prefer to remove it completely, but I can wait for Dart 2.0 for that).

@bwilkerson bwilkerson added Type-Defect area-analyzer Use area-analyzer for Dart analyzer issues, including the analysis server and code completion. labels May 6, 2015
@bwilkerson bwilkerson removed the Triaged label Nov 4, 2015
@kevmoo kevmoo added P2 A bug or feature request we're likely to work on type-bug Incorrect behavior (everything from a crash to more subtle misbehavior) and removed Priority-Medium labels Mar 1, 2016
@natebosch
Copy link
Member

Where are we at with this? Is it still a useful issue to have open?

@srawlins srawlins closed this as completed Jun 8, 2020
@lrhn
Copy link
Member

lrhn commented Jun 9, 2020

It still not following the spec.
The spec says:

If the first line of a multiline string consists solely of the whitespace characters defined by the production **WHITESPACE** possibly prefixed by `\`, then that line is ignored, including the line break at its end.

The "line" is ended by the first occurrence of CR+LF, CR (not followed by LF) or LF.
If the string literal is

var x = """\
x""";

(That is: quotes-backslash-linefeed), then all characters before the line end are not whitespace potentially prefixed by backslash, because there is a backslash which is not followed by a whitespace on that line. Both analyzer and VM accepts this.

We can then argue whether backslash escaping should happen before or after line splitting. On the "before' case, the backslash-linefeed reduces to linefeed, then you break before that linefeed, but the \n should also become a linefeed, and we won't break at that.
In the after case, there is a backslash on the first line, prior to the first line break, which is not before a space.

I'd still prefer to completely disallow backslashes on the first line - we ignore the first line of a multiline string if it contains nothing by whitespace characters. It's a tokenization issue to determine when the string contents start. Having to do string escape interpretation in order to figure out where the string starts leaves us with questions like the above, and we'd be better off just not worrying.

@lrhn lrhn reopened this Jun 9, 2020
@srawlins srawlins added the analyzer-spec Issues with the analyzer's implementation of the language spec label Jun 17, 2020
@srawlins
Copy link
Member

Everything in the comments here makes me believe this is a bug for back-ends (or maybe CFE). I'm not sure why it's tagged as analyzer. Tentatively tagging with VM.

@srawlins srawlins added area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. and removed area-analyzer Use area-analyzer for Dart analyzer issues, including the analysis server and code completion. analyzer-spec Issues with the analyzer's implementation of the language spec labels Apr 17, 2023
@alexmarkov
Copy link
Contributor

Back-ends don't parse Dart sources. This looks like a parser / CFE issue.
/cc @johnniwinther

@alexmarkov alexmarkov added area-front-end Use area-front-end for front end / CFE / kernel format related issues. and removed area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. labels Jun 14, 2023
@jensjoha
Copy link
Contributor

jensjoha commented Aug 3, 2023

So @lrhn what do we actually want here?

@lrhn
Copy link
Member

lrhn commented Aug 3, 2023

Honestly: I want to change the spec.
This is such a gnarly and idiosyncratic corner of the syntax, that I doubt anyone ever uses it. This issue documents that it's gnarly and not even implemented according to spec. And nobody noticed.
So it's mainly supporting evidence that changing the spec shouldn't matter

It's not really worth spending much time on being spec compliant for something that is never used, so I'd mark it P3 or P4 and carry on, unless you really want to work on it.
(Also because the current behavior is probably admissible by the spec, since it's written so vaguely. I don't think we allow interpolations any more, so it's just the question about whether to allow a backsplash before the newline. And then I really don't care.)

@jensjoha jensjoha added P4 and removed P2 A bug or feature request we're likely to work on labels Aug 3, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
area-front-end Use area-front-end for front end / CFE / kernel format related issues. P4 type-bug Incorrect behavior (everything from a crash to more subtle misbehavior)
Projects
None yet
Development

No branches or pull requests

10 participants