-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Jemalloc performance on 64-bit ARM #34476
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
So, what happens if we run well optimized
Ouch! EDIT: |
What precisely are you running, and what do the three numbers represent? All I can find is https://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=rust&id=1 , but the output is not similar to yours. (Regarding the armv7 case … it's actually not unheard of for a 32-bit version of a program to be faster than the 64-bit version on 64-bit hardware. The reason is that the pointers are smaller -> data structures are smaller -> more of them fit in cache. Obviously this is highly workload-dependent.) |
On Sun, 26 Jun 2016 01:09:27 -0700
Those were the timings.
Yes, but the relative difference, as I'd mentioned in the opening comment, was very small which means there's also a factor of LLVM backend maturity. |
What do the |
Nice trick! Now it's Thanks to your tweak, the |
I'd be in favor of turning jemalloc off everywhere except where it's already proven to be a win. Or everywhere period. |
@brson Now, that I've built rust on two different ARM architectures with The current disable switch makes it impossible to use jemalloc on a per crate basis, like this: #![feature(alloc_jemalloc)]
extern crate alloc_jemalloc; Or more simply |
sgtm |
The following news makes this issue much less interesting. Who knows what effect DVFS has under different loads. |
I've just run the
binary_trees
benchmark on anARMv8
, Cortex-A53 processor, having converted an Android TV box to Linux.I'd found previously, on a much weaker (but more power efficient)
armv7
Cortex A5, the results were equal. On the new machine (using the latest officialaarch64
rustc nightly)./binary_trees 23
produces the following results:sysalloc
1m28s 5m10s 0m10sjemalloc
1m35s 5m10s 0m53swhich is palpably worse actually, even though Cortex-A53 is a much stronger core.
I'm beginning to think
jemalloc
only makes sense on Intel processors with heaps or L1/L2 cache.More benchmark ideas welcome, though.
added retroactively:
To reproduce, unpack the attachment and run:
inside the binary_trees directory. Uncomment the first 2 lines in main.rs to produce a sysalloc version.
The text was updated successfully, but these errors were encountered: