Skip to content

Experimental per-file cache locking #23854

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

Experimental per-file cache locking #23854

wants to merge 5 commits into from

Conversation

dschuff
Copy link
Member

@dschuff dschuff commented Mar 6, 2025

Rather than locking the entire cache, use a lock for each cached file.
The lock is held while e.g. building the library, and only blocks other
processes that want the same library. (Other processes are not blocked
when they try to build separate libraries or variants).

@dschuff
Copy link
Member Author

dschuff commented Mar 6, 2025

Curious to see what you think of this. Experiments on my workstation and cloudtop show that it significantly speeds up running whole test suites (or even small numbers of tests, if multiple library compiles are required) when the cache isn't hot, on machines with lots of cores. I'd be interested in hearing whether this reflects how you all run tests locally.

It doesn't affect CircleCI times (since they run with frozen caches). One possible issue as currently written is that it doesn't respect EMCC_CORES, as it can spawn multiple system lib compiles, each of which can be highly parallel. This is desirable on local workstations with an excess of cores, but maybe not on e.g. Chromium CI where cores are more limited.

@dschuff
Copy link
Member Author

dschuff commented Mar 6, 2025

@sbc100
Copy link
Collaborator

sbc100 commented Mar 7, 2025

Awesome! If it goes faster that sounds great.

@@ -83,9 +86,9 @@ def ensure():

def erase():
ensure_setup()
with lock('erase'):
with lock('erase', global_cachelock):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the default so not needed?

@@ -185,16 +188,27 @@ def get(shortname, creator, what=None, force=False, quiet=False, deferred=False)
return str(cachename)


def setup_file(cache_file):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put this up top before its first use?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call this setup_lock or something that suggests its function?

@@ -185,16 +188,27 @@ def get(shortname, creator, what=None, force=False, quiet=False, deferred=False)
return str(cachename)


def setup_file(cache_file):
global cachedir, cache_file_locks, acquired_count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these looks like they get assigned in this function so I think you can delete this line.

if cache_file not in cache_file_locks:
file_path = Path(cache_file)
assert not file_path.is_absolute()
key_name = '_'.join(file_path.parts) + '.lock'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just have the lock live alongside the file? e.g. key_name = cache_file + '.lock'



def release_cache_lock():
def release_cache_lock(cachefile):
global acquired_count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is no longer needed I think

def setup():
global cachedir, cachelock, cachelock_name
global cachedir, global_cachelock
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global_cachelock no needed here I think

@dschuff
Copy link
Member Author

dschuff commented Mar 7, 2025

To be clear, this isn't ready yet. In particular the issue I mentioned above (avoiding oversaturating the cores), and there's also an issue that there were a few cases where the "global" lock is acquired, e.g. the sanity check, that are now unordered wrt library building, which causes problems. Mostly I'm wondering if you all think this would save you time and if it would be worth it.

@kripken
Copy link
Member

kripken commented Mar 7, 2025

This looks very interesting but I'm not sure if we can fix the oversaturating issue..? It seems like we'd need to live with it and hope for the best. Hard to say how big a risk that is.

How big is the speedup here?

@dschuff
Copy link
Member Author

dschuff commented Mar 7, 2025

Yeah, that seems like the trickiest issue. It seems most likely to affect CI machines or developer laptops, systems with small numbers of cores. I can think of a couple of possible ideas to at least mitigate the risk a bit.
The number of cores used would be the number of parallel compiles (which is in turn limited by either the test parallelism or the parallelism in the user's build), times the number of distinct libraries that are needed at once, times the number of files in the built library.
So one option to limit user damage would be to limit this optimization to only be used with the test runner. Then it's basically just us, our CI, and maybe a few other people, and this seems most useful when running tests, since we rebuild and test frequently, and it's mostly the test runner that does lots of concurrent links using different system libraries. If we limit it further to use a global cache when there are less than, say, 64 cores in the system, that would basically just be our beefy developer machines. Or we could make it opt-in completely via an env var or something.

On my couldtop, I can run core0 with an empty cache in 8:22 without this patch and 3:33 with it. So for that (probably best-case) scenario the benefits are pretty substantial. other tests run in 14:35 vs 8:57.

@sbc100
Copy link
Collaborator

sbc100 commented Mar 7, 2025

I don't normally find library builds to be huge bottleneck on my machine.

Are you finding yourself in a situations where you need to clear the whole cache a lot? I guess there are some situations where that really is necessary (e.g. when bisecting an llvm code gen issue), but normally I would expect to cache to be warm, right?

If, for example, you are finding yourself running ./emcc --clear-cache when iterating on just a single library then maybe we should make it more ergonomic to clear just one library. When working on wasmfs recently I've been doing a lot of ./embuilder clear libwasmfs-debug rather than clearing the whole cache. Would improving ./embuilder clear help your common use cases?

@dschuff
Copy link
Member Author

dschuff commented Mar 7, 2025

Yeah I think maybe the difference there is I'm less often working on emscripten itself (where you are probably just changing one library, or making a change such that most or all of the cached archives don't need rebuilding), and more often working on LLVM, where any change might affect any or all of the codegen. The primary debug/testing cycle is more local but once you have something that looks viable, you just want to test it on a whole test suite at once.
But that's not really a huge bottleneck for me most of the time either; I really only accidentally happened to notice the behavior, have the idea for why it might be slower than needed, and fall down the rabbit hole. Hence starting this conversation :)

@kripken
Copy link
Member

kripken commented Mar 8, 2025

That's a nice speedup... given it's so large I feel it's worth the risk here.

If we land this we can document it and suggest e.g. EMCC_CORES as a way to limit the number of cores, if people run into issues with too many processes.

@aheejin
Copy link
Member

aheejin commented Mar 8, 2025

I'm not familiar with the code part, but functionality-wise, it sounds nice.

My workflow is also somewhat similar to Derek's; I often work on LLVM that can affect the library compilation itself so I clear cache very often (and I confess I do rm -rf cache instead of emcc --clear-cache). But I don't run the whole test suite that often either, so the library building time was not a big bottleneck for me personally.

Not sure how many people use EMCC_CORES, but if that's a problem, maybe we can make this as an option that can be switched off or is switched off by default?

@dschuff
Copy link
Member Author

dschuff commented Mar 12, 2025

The real problem with this as written is that it actually sort of breaks EMCC_CORES (even more than the limitations it already has). With a global lock, the one holder of the lock can build a library using N cores (where N is EMCC_CORES), and if no other process needs the lock, you could have N more concurrent processes (the test runner uses EMCC_CORES or the build system is set up however the user wants it, which probably doesn't take library building into account). So you'd likely be limited to at most N2 processes. With per-file locks you could in theory have each of the N tests or user builds each building a library, for NN cores.

In practice this is of course unlikely (especially for a user build, which likely only does one link at a time and probably doesn't need multiple variants of the larger libraries like libc). But it's at least a possibility. Given that limitation, maybe I'm over-thinking though?

@sbc100
Copy link
Collaborator

sbc100 commented Mar 12, 2025

What if we instead pushed a but harder on embuilder as the way to build more than one library at a time? And try to move away from auto-building libraries (at least for emscripten developers). embuilder can then assume it has a global lock and build the world in parallel.

@dschuff
Copy link
Member Author

dschuff commented Mar 12, 2025

Embuilder actually already does a pretty good job of building the world in parallel, if EMCC_USE_NINJA is set, because it accumulates all the libraries into one ninja build. Its main limitation right now is that ports don't work that way yet (I didn't include them when I initially set up the Ninja support for system libs because they are built differently; I don't remember whether there was something inherently more difficult about the ports system or whether I just stopped because it was good enough).
Ports though are actually a big part of the bottlenecking that this PR solves, since they are often slower to build than system libs, so it might be worth it. I was actually just looking at that yesterday as a part of the effort to make the build more efficient overall; currently building ports is the biggest remaining source of inefficiency of the build-linux step because each port is built sequentially rather than merging them all together, so the bot spends a decent part of its time running just one core. sqlite3 is a big part of that because it's one file that takes 40s to build and we build it twice, with no parallelism. Building ports like sqlite3 and poppler can also bottleneck the test runner if a fast small system lib build like compiler-rt gets blocked behind something like sqlite3; that's what this PR fixes.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants