Skip to content

Cleanup of eat_while() in lexer #77629

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 1 commit into from
Oct 10, 2020

Conversation

Julian-Wollersberger
Copy link
Contributor

@Julian-Wollersberger Julian-Wollersberger commented Oct 6, 2020

The size of a lexer Token was inflated by the largest TokenKind variants LiteralKind::RawStr and RawByteStr, because

  • it used usize although u32 is sufficient in rustc, since crates must be smaller than 4GB,
  • and it stored the 20 bytes big RawStrError enum for error reporting.

If a raw string is invalid, it now needs to be reparsed to get the RawStrError data, but that is a very cold code path.

Technically this breaks other tools that depend on rustc_lexer because they are now also restricted to a max file size of 4GB. But this shouldn't matter in practice, and rustc_lexer isn't stable anyway.

Can I also get a perf run?

Edit: This makes no difference in performance. The PR now only contains a small cleanup.

@rust-highfive
Copy link
Contributor

r? @varkor

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 6, 2020
@jonas-schievink
Copy link
Contributor

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Collaborator

bors commented Oct 6, 2020

⌛ Trying commit 2cc461ba86901295d9b706d7e39a15e3d4efa6cf with merge 42c8db8af9dcb6f65f03cbbdf104da7ed2fc2018...

@camelid camelid added A-parser Area: The lexing & parsing of Rust source code to an AST I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. labels Oct 6, 2020
@bors
Copy link
Collaborator

bors commented Oct 7, 2020

💥 Test timed out

@Julian-Wollersberger
Copy link
Contributor Author

I fixed the test failure and added @pickfire's suggestion.

I don't understand why bors timed out. Was it spurious?
Can I get another perf run?

@Xanewok
Copy link
Member

Xanewok commented Oct 7, 2020

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Collaborator

bors commented Oct 7, 2020

⌛ Trying commit 2663c35bf86871e1849400eca6417c1fed02990c with merge b468e924ec58d34e078526586335258b57deb9a8...

@Julian-Wollersberger
Copy link
Contributor Author

Julian-Wollersberger commented Oct 7, 2020

I noticed a potential bug in eat_while(): it doesn't account for the number of UTF8 bytes.
Fixed it by inlining it in the two places where the count is used and also simplified the logic there.
It might be a little perf improvement too.

@Xanewok can you restart the perf-run? EDIT: Nevermind. Sorry for bothering you.

@jyn514
Copy link
Member

jyn514 commented Oct 7, 2020

@Julian-Wollersberger bug fixes don't need a perf run IMO, unless they strongly affect perf.

@jyn514 jyn514 added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Oct 7, 2020
@Julian-Wollersberger
Copy link
Contributor Author

I guess rust-timer got stuck. @Mark-Simulacrum do you know what happened here and can you restart the perf run?
Maybe rust-timer got confused because bors timed out?

@jyn514
Copy link
Member

jyn514 commented Oct 8, 2020

@rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@jyn514
Copy link
Member

jyn514 commented Oct 8, 2020

🤦 I think it got confused because you pushed more commits maybe?

@bors try

@bors
Copy link
Collaborator

bors commented Oct 8, 2020

⌛ Trying commit 8caa6072eef69d6f4098ed7a04f2e674c22dbe57 with merge 85f2a07ff3f2a86b9ef4ebe097dc23656933e80b...

@bors
Copy link
Collaborator

bors commented Oct 8, 2020

☀️ Try build successful - checks-actions, checks-azure
Build commit: 85f2a07ff3f2a86b9ef4ebe097dc23656933e80b (85f2a07ff3f2a86b9ef4ebe097dc23656933e80b)

@rust-timer
Copy link
Collaborator

Queued 85f2a07ff3f2a86b9ef4ebe097dc23656933e80b with parent f1dab24, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (85f2a07ff3f2a86b9ef4ebe097dc23656933e80b): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never

…er of UTF8 bytes.

Fixed it by inlining it in the two places where the count is used and simplified the logic there.
@Julian-Wollersberger
Copy link
Contributor Author

The perf results couldn't be more neutral. Oh well.
I've removed the first two commits and only the bugfix remains. This can be rollup=always I guess.

Changing usize to u32 might still be worth it as a cleanup so rustc_lexer is more consistent with the rest of rustc. Opinions?
@varkor this is ready for review again.

@pickfire
Copy link
Contributor

pickfire commented Oct 9, 2020

Oh, there doesn't seemed to be much performance gain but I wonder if memory footprint is reduced? From what I see the memory footprint is not tracked in the benchmarks.

@Mark-Simulacrum
Copy link
Member

Please update the PR description and title if you significantly change the pull request, right now it is not clear that this is a bugfix. I would also expect to see a test case if there is a bug being fixed here.

@mati865
Copy link
Contributor

mati865 commented Oct 9, 2020

Oh, there doesn't seemed to be much performance gain but I wonder if memory footprint is reduced? From what I see the memory footprint is not tracked in the benchmarks.

It's noisy as always but in terms of the memory it's neutral/slightly worse.

@Julian-Wollersberger Julian-Wollersberger changed the title Reduce size of lexer::Token from 72 to 16 bytes Cleanup of eat_while() in lexer Oct 10, 2020
@Julian-Wollersberger
Copy link
Contributor Author

Julian-Wollersberger commented Oct 10, 2020

Please update the PR description and title if you significantly change the pull request

Ok, makes sense. Done.

I would also expect to see a test case if there is a bug being fixed here.

Sorry, I wasn't precise. It could have been a bug if the count returned by eat_while() was used in a place where non-ASCII chars can be, but luckily it was only used to count the '#' in raw strings, which are ASCII chars.
So it's not really a bugfix but a cleanup to prevent a potential bug.

@Julian-Wollersberger
Copy link
Contributor Author

From what I see the memory footprint is not tracked in the benchmarks.

@pickfire At the bottom of the page you can select "max-rss" and click "submit" at the right edge of the page.
(There is PR rust-lang/rustc-perf#782 to improve the design.)

@varkor
Copy link
Member

varkor commented Oct 10, 2020

@bors r+ rollup

@bors
Copy link
Collaborator

bors commented Oct 10, 2020

📌 Commit bd49ded has been approved by varkor

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 10, 2020
JohnTitor added a commit to JohnTitor/rust that referenced this pull request Oct 10, 2020
…rError, r=varkor

Cleanup of `eat_while()` in lexer

The size of a lexer Token was inflated by the largest `TokenKind` variants `LiteralKind::RawStr` and `RawByteStr`, because
* it used `usize` although `u32` is sufficient in rustc, since crates must be smaller than 4GB,
* and it stored the 20 bytes big `RawStrError` enum for error reporting.

If a raw string is invalid, it now needs to be reparsed to get the `RawStrError` data, but that is a very cold code path.

Technically this breaks other tools that depend on rustc_lexer because they are now also restricted to a max file size of 4GB. But this shouldn't matter in practice, and rustc_lexer isn't stable anyway.

Can I also get a perf run?

Edit: This makes no difference in performance. The PR now only contains a small cleanup.
bors added a commit to rust-lang-ci/rust that referenced this pull request Oct 10, 2020
Rollup of 10 pull requests

Successful merges:

 - rust-lang#77195 (Link to documentation-specific guidelines.)
 - rust-lang#77629 (Cleanup of `eat_while()` in lexer)
 - rust-lang#77709 (Link Vec leak doc to Box)
 - rust-lang#77738 (fix __rust_alloc_error_handler comment)
 - rust-lang#77748 (Dead code cleanup in windows-gnu std)
 - rust-lang#77754 (Add TraitDef::find_map_relevant_impl)
 - rust-lang#77766 (Clarify the debug-related values should take boolean)
 - rust-lang#77777 (doc: disambiguate stat in MetadataExt::as_raw_stat)
 - rust-lang#77782 (Fix typo in error code description)
 - rust-lang#77787 (Update `changelog-seen` in config.toml.example)

Failed merges:

r? `@ghost`
@bors bors merged commit c14c9ba into rust-lang:master Oct 10, 2020
@rustbot rustbot added this to the 1.49.0 milestone Oct 10, 2020
nnethercote added a commit to nnethercote/rust that referenced this pull request Jul 31, 2022
From 72 bytes to 12 bytes (on x86-64).

There are two parts to this:
- Changing various source code offsets from 64-bit to 32-bit. This is
  not a problem because the rest of rustc also uses 32-bit source code
  offsets. This means `Token` is no longer `Copy` but this causes no
  problems.
- Removing the `RawStrError` from `LiteralKind`. Raw string literal
  invalidity is now indicated by a `None` value within
  `RawStr`/`RawByteStr`, and the new `validate_raw_str` function can be
  used to re-lex an invalid raw string literal to get the `RawStrError`.

There is one very small change in behaviour. Previously, if a raw string
literal matched both the `InvalidStarter` and `TooManyHashes` cases,
the latter would override the former. This has now changed, because
`raw_double_quoted_string` now uses `?` and so returns immediately upon
detecting the `InvalidStarter` case. I think this is a slight
improvement to report the earlier-detected error, and it explains the
change in the `test_too_many_hashes` test.

The commit also removes a couple of comments that refer to rust-lang#77629 and
say that the size of these types don't affect performance. These
comments are wrong, though the performance effect is small.
flip1995 pushed a commit to flip1995/rust that referenced this pull request Aug 11, 2022
From 72 bytes to 12 bytes (on x86-64).

There are two parts to this:
- Changing various source code offsets from 64-bit to 32-bit. This is
  not a problem because the rest of rustc also uses 32-bit source code
  offsets. This means `Token` is no longer `Copy` but this causes no
  problems.
- Removing the `RawStrError` from `LiteralKind`. Raw string literal
  invalidity is now indicated by a `None` value within
  `RawStr`/`RawByteStr`, and the new `validate_raw_str` function can be
  used to re-lex an invalid raw string literal to get the `RawStrError`.

There is one very small change in behaviour. Previously, if a raw string
literal matched both the `InvalidStarter` and `TooManyHashes` cases,
the latter would override the former. This has now changed, because
`raw_double_quoted_string` now uses `?` and so returns immediately upon
detecting the `InvalidStarter` case. I think this is a slight
improvement to report the earlier-detected error, and it explains the
change in the `test_too_many_hashes` test.

The commit also removes a couple of comments that refer to rust-lang#77629 and
say that the size of these types don't affect performance. These
comments are wrong, though the performance effect is small.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
A-parser Area: The lexing & parsing of Rust source code to an AST I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.