Skip to content

A case of compound x86_64 performance regression caused by LLVM 20 and #124810 #139730

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
MoSal opened this issue Apr 13, 2025 · 3 comments
Open
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such P-high High priority regression-untriaged Untriaged performance or correctness regression. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@MoSal
Copy link

MoSal commented Apr 13, 2025

Code

I tried this code:

use std::{iter::Peekable, str::CharIndices};
use std::sync::LazyLock as Lazy;

use memchr::memmem::Finder;
use itertools::Itertools;


struct Foo<'a> {
    txt: &'a str,
    ci: Peekable<CharIndices<'a>>,
}

impl<'a> Foo<'a> {
    fn new(txt: &'a str) -> Self {
        let ci = txt.char_indices().peekable();
        Self { txt, ci }
    }

    fn next(&mut self) -> Option<(usize, usize)> {
        static FI: Lazy<Finder> = Lazy::new(|| Finder::new(b"\n"));

        // XXX: switching between these two lines makes a difference
        let &(start, _) = self.ci.peek()?;
        //let start = self.ci.peek().map(|&(idx, _)| idx)?;
        
        let matches = FI
            .find_iter(&self.txt.as_bytes()[start..])
            .map(|idx| idx + start);

        for next_match in matches {
            if self.txt.is_char_boundary(next_match) {
                self.ci.by_ref()
                    //.take_while(|&(idx, _)| idx < next_match)
                    .peeking_take_while(|&(idx, _)| idx <= next_match)
                    .for_each(drop);
                return Some((start, next_match))
            }
        }

        self.ci.by_ref().for_each(drop);
        Some((start, self.txt.len()))
    }
}

fn main() {
    let s = "1".repeat(20_000) + "\n";
    let sn = s.repeat(200_000);
    let mut v = Foo::new(&sn);
    let mut p = Vec::with_capacity(200_000);
    while let Some(r) = v.next() {
        p.push(r);
    }
}

Cargo.toml

[package]
name = "tt-fluctuations"
version = "0.1.0"
edition = "2024"

[dependencies]
itertools = "0.14.0"
memchr = "2.7.4"

[profile.release]
codegen-units = 1
lto = true
strip = true
Cargo.lock

# This file is automatically @generated by Cargo.
# It is not intended for manual editing.
version = 4

[[package]]
name = "either"
version = "1.15.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "48c757948c5ede0e46177b7add2e67155f70e33c07fea8284df6576da70b3719"

[[package]]
name = "itertools"
version = "0.14.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2b192c782037fadd9cfa75548310488aabdbf3d2da73885b31bd0abd03351285"
dependencies = [
 "either",
]

[[package]]
name = "memchr"
version = "2.7.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "78ca9ab1a0babb1e7d5695e3530886289c18cf2f87ec19a575a0abdce112e3a3"

[[package]]
name = "tt-fluctuations"
version = "0.1.0"
dependencies = [
 "itertools",
 "memchr",
]

Commits used to test:

Besides the two direct regressions. The fact that alternating between these two lines may cause
a big difference in performance is curious:

let &(start, _) = self.ci.peek()?; // called without_map in bench
let start = self.ci.peek().map(|&(idx, _)| idx)?; // with_map

But maybe that's a different issue.

Summary of hyperfine run:

Summary
  ./2162e9d_with_map ran
    1.06 ± 0.01 times faster than ./ce36a96_with_map
    1.12 ± 0.01 times faster than ./2162e9d_without_map
    1.38 ± 0.01 times faster than ./934880f586f_with_map
    1.66 ± 0.01 times faster than ./ce36a96_without_map
    1.91 ± 0.01 times faster than ./934880f586f_without_map

Full numbers

Benchmark 1: ./2162e9d_with_map
  Time (mean ± σ):      3.894 s ±  0.014 s    [User: 3.202 s, System: 0.682 s]
  Range (min … max):    3.876 s …  3.909 s    4 runs

Benchmark 2: ./2162e9d_without_map
  Time (mean ± σ):      4.380 s ±  0.034 s    [User: 3.686 s, System: 0.683 s]
  Range (min … max):    4.346 s …  4.419 s    4 runs

Benchmark 3: ./934880f586f_with_map
  Time (mean ± σ):      5.388 s ±  0.020 s    [User: 4.702 s, System: 0.671 s]
  Range (min … max):    5.369 s …  5.415 s    4 runs

Benchmark 4: ./934880f586f_without_map
  Time (mean ± σ):      7.448 s ±  0.018 s    [User: 6.758 s, System: 0.666 s]
  Range (min … max):    7.423 s …  7.465 s    4 runs

Benchmark 5: ./ce36a96_with_map
  Time (mean ± σ):      4.128 s ±  0.016 s    [User: 3.448 s, System: 0.666 s]
  Range (min … max):    4.111 s …  4.149 s    4 runs

Benchmark 6: ./ce36a96_without_map
  Time (mean ± σ):      6.448 s ±  0.007 s    [User: 5.734 s, System: 0.696 s]
  Range (min … max):    6.441 s …  6.457 s    4 runs

Summary
  ./2162e9d_with_map ran
    1.06 ± 0.01 times faster than ./ce36a96_with_map
    1.12 ± 0.01 times faster than ./2162e9d_without_map
    1.38 ± 0.01 times faster than ./934880f586f_with_map
    1.66 ± 0.01 times faster than ./ce36a96_without_map
    1.91 ± 0.01 times faster than ./934880f586f_without_map

So building with current nightly and

let &(start, _) = self.ci.peek()?;

is almost twice as slow than building with stable and using

let start = self.ci.peek().map(|&(idx, _)| idx)?;

Version it worked on

It most recently worked on: 2162e9d

Version with regression

First regression

rustc --version --verbose:

rustc 1.87.0-nightly (ce36a966c 2025-02-17)
binary: rustc
commit-hash: ce36a966c79e109dabeef7a47fe68e5294c6d71e
commit-date: 2025-02-17
host: x86_64-unknown-linux-gnu
release: 1.87.0-nightly
LLVM version: 20.1.0

Second regression

rustc --version --verbose:

rustc 1.88.0-nightly (934880f58 2025-04-09)
binary: rustc
commit-hash: 934880f586f6ac1f952c7090e2a943fcd7775e7b
commit-date: 2025-04-09
host: x86_64-unknown-linux-gnu
release: 1.88.0-nightly
LLVM version: 20.1.2
@MoSal MoSal added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Apr 13, 2025
@rustbot rustbot added I-prioritize Issue: Indicates that prioritization has been requested for this issue. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Apr 13, 2025
@jieyouxu jieyouxu added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such and removed C-bug Category: This is a bug. labels Apr 13, 2025
@apiraino
Copy link
Contributor

Assigning priority (discussion on Zulip).

Possibly related to #139370

@rustbot label -I-prioritize +P-high

@rustbot rustbot added P-high High priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Apr 14, 2025
@jieyouxu jieyouxu removed the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Apr 14, 2025
@lincot
Copy link
Contributor

lincot commented Apr 19, 2025

A quick run on my x86_64 comparing nightly-04-09 (d4f880f) and nightly-04-10 (934880f) showed
no difference without using map, but curiously a 1.76x slowdown with map:

Summary
  ./stable_with_map ran
    1.00 ± 0.02 times faster than ./2025-04-09_with_map
    1.00 ± 0.03 times faster than ./stable_without_map
    1.21 ± 0.02 times faster than ./2025-04-10_without_map
    1.22 ± 0.02 times faster than ./2025-04-09_without_map
    1.77 ± 0.04 times faster than ./2025-04-10_with_map
Full results
Benchmark 1: ./stable_with_map
  Time (mean ± σ):      2.480 s ±  0.044 s    [User: 2.290 s, System: 0.182 s]
  Range (min … max):    2.430 s …  2.519 s    4 runs

Benchmark 2: ./stable_without_map
  Time (mean ± σ):      2.489 s ±  0.043 s    [User: 2.308 s, System: 0.173 s]
  Range (min … max):    2.437 s …  2.525 s    4 runs

Benchmark 3: ./2025-04-09_with_map
  Time (mean ± σ):      2.486 s ±  0.034 s    [User: 2.309 s, System: 0.170 s]
  Range (min … max):    2.442 s …  2.520 s    4 runs

Benchmark 4: ./2025-04-09_without_map
  Time (mean ± σ):      3.030 s ±  0.007 s    [User: 2.854 s, System: 0.167 s]
  Range (min … max):    3.019 s …  3.035 s    4 runs

Benchmark 5: ./2025-04-10_with_map
  Time (mean ± σ):      4.381 s ±  0.039 s    [User: 4.198 s, System: 0.169 s]
  Range (min … max):    4.335 s …  4.413 s    4 runs

Benchmark 6: ./2025-04-10_without_map
  Time (mean ± σ):      3.005 s ±  0.023 s    [User: 2.817 s, System: 0.178 s]
  Range (min … max):    2.981 s …  3.033 s    4 runs

I don't see how #124810 could cause this, though.

@MoSal
Copy link
Author

MoSal commented Apr 19, 2025

@lincot

Interesting. Maybe it's another case of phantom behavior from LLVM 20!

I just redid my original full hyperfine run which included 48f89e7:

  ./2162e9d_with_map ran
    1.04 ± 0.01 times faster than ./ce36a96_with_map
    1.05 ± 0.01 times faster than ./48f89e76596_with_map
    1.10 ± 0.01 times faster than ./2162e9d_without_map
    1.32 ± 0.01 times faster than ./934880f586f_with_map
    1.54 ± 0.01 times faster than ./ce36a96_without_map
    1.55 ± 0.01 times faster than ./48f89e76596_without_map
    1.76 ± 0.02 times faster than ./934880f586f_without_map

The difference between 48f89e7 and 934880f is what lead me to believe that #124810 is the (supposed) culprit.

For reference: Tests were run on a machine with i7-7700K CPU with frequency locked at 4GHz.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such P-high High priority regression-untriaged Untriaged performance or correctness regression. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants