Cleanup of `eat_while()` in lexer #77629

Julian-Wollersberger · 2020-10-06T21:07:01Z

The size of a lexer Token was inflated by the largest TokenKind variants LiteralKind::RawStr and RawByteStr, because

it used usize although u32 is sufficient in rustc, since crates must be smaller than 4GB,
and it stored the 20 bytes big RawStrError enum for error reporting.

If a raw string is invalid, it now needs to be reparsed to get the RawStrError data, but that is a very cold code path.

Technically this breaks other tools that depend on rustc_lexer because they are now also restricted to a max file size of 4GB. But this shouldn't matter in practice, and rustc_lexer isn't stable anyway.

Can I also get a perf run?

Edit: This makes no difference in performance. The PR now only contains a small cleanup.

rust-highfive · 2020-10-06T21:07:04Z

r? @varkor

(rust_highfive has picked a reviewer for you, use r? to override)

jonas-schievink · 2020-10-06T21:08:31Z

@bors try @rust-timer queue

rust-timer · 2020-10-06T21:08:32Z

Awaiting bors try build completion

bors · 2020-10-06T21:08:43Z

⌛ Trying commit 2cc461ba86901295d9b706d7e39a15e3d4efa6cf with merge 42c8db8af9dcb6f65f03cbbdf104da7ed2fc2018...

bors · 2020-10-07T01:23:43Z

💥 Test timed out

compiler/rustc_lexer/src/cursor.rs

Julian-Wollersberger · 2020-10-07T07:27:01Z

I fixed the test failure and added @pickfire's suggestion.

I don't understand why bors timed out. Was it spurious?
Can I get another perf run?

Xanewok · 2020-10-07T09:42:47Z

@bors try @rust-timer queue

rust-timer · 2020-10-07T09:42:49Z

Awaiting bors try build completion

bors · 2020-10-07T09:43:01Z

⌛ Trying commit 2663c35bf86871e1849400eca6417c1fed02990c with merge b468e924ec58d34e078526586335258b57deb9a8...

Julian-Wollersberger · 2020-10-07T11:21:43Z

I noticed a potential bug in eat_while(): it doesn't account for the number of UTF8 bytes.
Fixed it by inlining it in the two places where the count is used and also simplified the logic there.
It might be a little perf improvement too.

~~@Xanewok can you restart the perf-run?~~ EDIT: Nevermind. Sorry for bothering you.

jyn514 · 2020-10-07T14:01:10Z

@Julian-Wollersberger bug fixes don't need a perf run IMO, unless they strongly affect perf.

compiler/rustc_parse/src/lexer/mod.rs

compiler/rustc_lexer/src/lib.rs

compiler/rustc_lexer/src/cursor.rs

compiler/rustc_lexer/src/lib.rs

Julian-Wollersberger · 2020-10-08T13:12:05Z

I guess rust-timer got stuck. @Mark-Simulacrum do you know what happened here and can you restart the perf run?
Maybe rust-timer got confused because bors timed out?

jyn514 · 2020-10-08T13:17:16Z

@rust-timer queue

rust-timer · 2020-10-08T13:17:18Z

Awaiting bors try build completion

jyn514 · 2020-10-08T13:17:50Z

🤦 I think it got confused because you pushed more commits maybe?

@bors try

bors · 2020-10-08T13:18:11Z

⌛ Trying commit 8caa6072eef69d6f4098ed7a04f2e674c22dbe57 with merge 85f2a07ff3f2a86b9ef4ebe097dc23656933e80b...

bors · 2020-10-08T14:03:59Z

☀️ Try build successful - checks-actions, checks-azure
Build commit: 85f2a07ff3f2a86b9ef4ebe097dc23656933e80b (85f2a07ff3f2a86b9ef4ebe097dc23656933e80b)

rust-timer · 2020-10-08T14:04:01Z

Queued 85f2a07ff3f2a86b9ef4ebe097dc23656933e80b with parent f1dab24, future comparison URL.

rust-timer · 2020-10-08T16:57:55Z

Finished benchmarking try commit (85f2a07ff3f2a86b9ef4ebe097dc23656933e80b): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never

…er of UTF8 bytes. Fixed it by inlining it in the two places where the count is used and simplified the logic there.

Julian-Wollersberger · 2020-10-09T09:23:34Z

The perf results couldn't be more neutral. Oh well.
I've removed the first two commits and only the bugfix remains. This can be rollup=always I guess.

Changing usize to u32 might still be worth it as a cleanup so rustc_lexer is more consistent with the rest of rustc. Opinions?
@varkor this is ready for review again.

pickfire · 2020-10-09T15:20:17Z

Oh, there doesn't seemed to be much performance gain but I wonder if memory footprint is reduced? From what I see the memory footprint is not tracked in the benchmarks.

Mark-Simulacrum · 2020-10-09T15:24:44Z

Please update the PR description and title if you significantly change the pull request, right now it is not clear that this is a bugfix. I would also expect to see a test case if there is a bug being fixed here.

mati865 · 2020-10-09T17:08:17Z

Oh, there doesn't seemed to be much performance gain but I wonder if memory footprint is reduced? From what I see the memory footprint is not tracked in the benchmarks.

It's noisy as always but in terms of the memory it's neutral/slightly worse.

Julian-Wollersberger · 2020-10-10T09:33:02Z

Please update the PR description and title if you significantly change the pull request

Ok, makes sense. Done.

I would also expect to see a test case if there is a bug being fixed here.

Sorry, I wasn't precise. It could have been a bug if the count returned by eat_while() was used in a place where non-ASCII chars can be, but luckily it was only used to count the '#' in raw strings, which are ASCII chars.
So it's not really a bugfix but a cleanup to prevent a potential bug.

Julian-Wollersberger · 2020-10-10T09:39:19Z

From what I see the memory footprint is not tracked in the benchmarks.

@pickfire At the bottom of the page you can select "max-rss" and click "submit" at the right edge of the page.
(There is PR rust-lang/rustc-perf#782 to improve the design.)

varkor · 2020-10-10T13:29:44Z

@bors r+ rollup

bors · 2020-10-10T13:29:45Z

📌 Commit bd49ded has been approved by varkor

…rError, r=varkor Cleanup of `eat_while()` in lexer The size of a lexer Token was inflated by the largest `TokenKind` variants `LiteralKind::RawStr` and `RawByteStr`, because * it used `usize` although `u32` is sufficient in rustc, since crates must be smaller than 4GB, * and it stored the 20 bytes big `RawStrError` enum for error reporting. If a raw string is invalid, it now needs to be reparsed to get the `RawStrError` data, but that is a very cold code path. Technically this breaks other tools that depend on rustc_lexer because they are now also restricted to a max file size of 4GB. But this shouldn't matter in practice, and rustc_lexer isn't stable anyway. Can I also get a perf run? Edit: This makes no difference in performance. The PR now only contains a small cleanup.

Rollup of 10 pull requests Successful merges: - rust-lang#77195 (Link to documentation-specific guidelines.) - rust-lang#77629 (Cleanup of `eat_while()` in lexer) - rust-lang#77709 (Link Vec leak doc to Box) - rust-lang#77738 (fix __rust_alloc_error_handler comment) - rust-lang#77748 (Dead code cleanup in windows-gnu std) - rust-lang#77754 (Add TraitDef::find_map_relevant_impl) - rust-lang#77766 (Clarify the debug-related values should take boolean) - rust-lang#77777 (doc: disambiguate stat in MetadataExt::as_raw_stat) - rust-lang#77782 (Fix typo in error code description) - rust-lang#77787 (Update `changelog-seen` in config.toml.example) Failed merges: r? `@ghost`

From 72 bytes to 12 bytes (on x86-64). There are two parts to this: - Changing various source code offsets from 64-bit to 32-bit. This is not a problem because the rest of rustc also uses 32-bit source code offsets. This means `Token` is no longer `Copy` but this causes no problems. - Removing the `RawStrError` from `LiteralKind`. Raw string literal invalidity is now indicated by a `None` value within `RawStr`/`RawByteStr`, and the new `validate_raw_str` function can be used to re-lex an invalid raw string literal to get the `RawStrError`. There is one very small change in behaviour. Previously, if a raw string literal matched both the `InvalidStarter` and `TooManyHashes` cases, the latter would override the former. This has now changed, because `raw_double_quoted_string` now uses `?` and so returns immediately upon detecting the `InvalidStarter` case. I think this is a slight improvement to report the earlier-detected error, and it explains the change in the `test_too_many_hashes` test. The commit also removes a couple of comments that refer to rust-lang#77629 and say that the size of these types don't affect performance. These comments are wrong, though the performance effect is small.

rust-highfive assigned varkor Oct 6, 2020

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 6, 2020

camelid added A-parser Area: The lexing & parsing of Rust source code to an AST I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. labels Oct 6, 2020