-
Notifications
You must be signed in to change notification settings - Fork 11.4k
Implement get_num_physical_cores() for Windows #1278
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
base: master
Are you sure you want to change the base?
Conversation
physical_cores += info->Processor.GroupCount; | ||
|
||
for (WORD i = 0; i < info->Processor.GroupCount; ++i) { | ||
int core_count = static_cast<int>(__popcnt64(info->Processor.GroupMask[i].Mask)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will __popcnt64
work on other compilers such as mingw? Sadly, std::popcount
is C++20.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know about mingw. It sounds like it didn't work on older versions of it anyway.
__popcnt64
is used here for counting the logical cores, which I believe to be unnecessary. I can rip out all the logical core stuff if it's working on P/E systems since we only care about the physical cores count on modern CPUs. I have yet to see a single case where hyperthreading helps llama.cpp performance.
For the record: macOS code path already does that (take only performance cores into account when this info is available). |
As explained in this comment #842 (comment) The proper way to detect whether you are on the Intel P/E architecture is to use the corresponding machine instructions. The example function was posted in a comment in the same thread. // 1 = P core
// 0 = E core
// -1 = fail / not P/E arch
inline int is_thread_on_p_core(void) {
static unsigned const char a[] = {0x31,0xC9,0xB8,0x1A,0x00,0x00,0x00,0x53,0x0F,0xA2,0x5B,0xC1,0xE8,0x18,0x83,0xF8,0x40,0x0F,0x85,0x06,0x00,0x00,0x00,0xB8,0x01,0x00,0x00,0x00,0xC3,0x83,0xE8,0x20,0xF7,0xD8,0x19,0xC0,0xC3};
return ((int (*)(void)) (void*)((void*)a))();
}
// 1 = hybrid x86 cpu
inline int is_x86_hybrid_cpu(void) {
static unsigned const char a[] = {0x31,0xC9,0xB8,0x07,0x00,0x00,0x00,0x53,0x0F,0xA2,0x5B,0xC1,0xEA,0x0F,0x83,0xE2,0x01,0x89,0xD0,0xC3};
return ((int (*)(void)) (void*)((void*)a))();
}
// 1 = intel 12th/13th gen cpu
inline int is_intel_p_e_core_arch(void) {
return (is_x86_hybrid_cpu() && (is_thread_on_p_core() >= 0));
} So testing for When new processors are released and new features added along with new instructions, any compiler that came before it cannot possibly be aware of those instructions. So the cutting-edge stuff will always have spotty standard library, platform and compiler support. The beauty of machine code is that it is completely insert-anything-here-agnostic, as the instructions are set in stone. Yes, they will be expanded upon, but the previous instructions will keep working no matter what platform or planet you are on or the iteration of a compiler you are working with. It will simply just work™ , no matter what. By going directly to the source of the information, removing those layers of abstractions, you will also reduce your maintainability burden, as there is no longer any need for different codepaths for different platforms, finding out if the user is on a recent enough platform/kernel version, has the correct versions of the required standard libraries and also happens to have a recent compiler version to even have a support for them. The same goes for detecting the number of logical/physical cores, which can be tested by checking the SMT/HTT bits and the core bits / logical bits. Implement it correctly, in machine code, once, to never need to maintain it again or deal with any library/platform/compiler issues. Everything can be literally found in the good ol' RTFM. The relevant bits of info can be found right here: Intel ® Architecture Instruction Set Extensions and Future Features "Chapter 1.5 CPUID Instruction" AMD64 Architecture Programmer’s Manual Volume 3: General-Purpose and System Instructions "Appendix D: Instruction Subsets and CPUID Feature Flags" Or you can probably just ask ChatGPT. |
@anzz1 Thanks for the information. You remind me of a friend in college who was big on bare metal programming. He was very passionate about getting every ounce out of his hardware, which was always the latest and greatest. I did notice those functions over at 842, but it seemed to me that they only tell you what our current thread is running on. Since we want the number of non-hyperthreaded, non-efficiency cores, I think we'd also have to figure out the mask for each of the non hyperthreaded cores and loop over it, setting our thread affinity to each (OS call?) and checking that we've switched over before calling As for this patch, I was just trying to help by filling in a |
For clarity of the code, I would very much prefer using the Windows API than executing a magic hex string. |
This. Plus |
You are correct. This is the stonewall I hit too while researching this, it's been a while now but I remember that a note buried somewhere in the docs ended with that you can only determine the P/E status of the currently running thread, with no further clarification. It doesn't help that a lot of the documentation is "old" regarding topology identification, like this one Intel® 64 Architecture Processor Topology Enumeration from January 2018. It certainly seems a dumb approach having to first figure out the amount of cores and their mask for locking the affinity and starting a thread for each one where you run the cpuid instruction to determine whether it's a P/E core and then count the total. But I could be wrong, as I couldn't find further information and there very well might be a way to determine them without using any such tricks. I stopped my research there as going by the documentation alone and not having the ability to experiment or validate anything without having the processor wasn't going to cut it. Definitely someone with the processor should test this code if it works, but I fear that the EfficiencyClass is not going to report the correct value and the P/E thread test mess is going to be needed.
Absolutely agree with you here, having to use a magic hex string is tremendously stupid. For that we can blame MSVC, as it doesn't support x86_64 assembly. Any half decent compiler like GCC obviously supports it. MSVC used to too, since VC6 back from 1998, on x86 it did and does support inline asm and there is even __declspec(naked) for writing straight assembly without any compiler decorations to mess with your stuff. For some inexplicable reason none of this was implemented for x86_64 (but was implemented for ARM). So the "magic hex string" is only there for MSVC. For every other compiler you could just write it in inline asm. As posted in the comment, such a magic string should obviously have the companying assembly written in a comment so it's not a magic string anymore. Another option is to use compiler-specific intrinsics like MSVC's __cpuid / __cpuidex or GCC's __get_cpuid. Those will translate to the same machine code but it's certainly easier to read and understand. Then again you'd have to write a case for each compiler separately. So yeah, a magic hex string is definitely stupid, obtuse and unreadable. But from the heap of garbage options, it seems to be the least moldy one.
Yeah this isn't strictly the same thing, the only reason I interjected was the Or maybe EfficiencyClass does actually work for determining P/E cores and this is solution works as-is, not platform-independent but certainly easier to read and understand than a magic string. One good thing WinAPI has going for it that if it works once it probably won't need any maintenance, as Microsoft is pretty great at supporting the api funcs for a long time. EDIT: On a second thought, someone probably already has solved this issue and there already exists x86_64 assembly code for both the physical/logical core count and P/E core count functions as a .S file which could be added to the makefiles, that would also be supported in MSVC via masm and GCC/Clang/others natively. |
This implements
get_num_physical_cores
on Windows. It usesGetLogicalProcessorInformationEx
to tease out the number of physical cores and the number of those that are efficiency class 0. This should also address #572 on Windows, but on Windows only. I think something similar will have to be done for the otherget_num_physical_cores
paths in other OSs.The only other thing this does it move the default param generation from
gpt_params_parse
togpt_print_usage
where it's used. Without thisgpt_params_parse
is called twice. Once for the instantiation inmain
and once for the instantiation ingpt_params_parse
. This isn't terrible but it also means get_num_physical_cores() is called twice no matter what.This works on all the Windows systems I tested it on, but none of mine have performance cores. According to the documentation, this should work in those cases as well but I would like to verify it with someone who has P/E cores.
This code also prints the number of cores found to stderr and that should be removed once we've verified it works with P/E core CPU systems. Preferably before we commit, but anything is fine.
I'll edit this to keep it up to date, this is the entirety of the logic with the extra debugging lines removed. No lines will need to be added or modified to remove that extra information, so it'll just be a bunch of
-
s.