-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
cache significantly slows down black due to pathlib #1950
Comments
here is a patch removing pathlib from the Cache $ git diff --staged | cat
diff --git a/src/black/__init__.py b/src/black/__init__.py
index 9034bf6..e5736f0 100644
--- a/src/black/__init__.py
+++ b/src/black/__init__.py
@@ -93,7 +93,7 @@ Transformer = Callable[["Line", Collection["Feature"]], Iterator["Line"]]
Timestamp = float
FileSize = int
CacheInfo = Tuple[Timestamp, FileSize]
-Cache = Dict[Path, CacheInfo]
+Cache = Dict[str, CacheInfo]
out = partial(click.secho, bold=True, err=True)
err = partial(click.secho, fg="red", err=True)
@@ -724,7 +724,8 @@ def reformat_one(
if write_back not in (WriteBack.DIFF, WriteBack.COLOR_DIFF):
cache = read_cache(mode)
res_src = src.resolve()
- if res_src in cache and cache[res_src] == get_cache_info(res_src):
+ res_src_s = str(res_src)
+ if res_src_s in cache and cache[res_src_s] == get_cache_info(res_src):
changed = Changed.CACHED
if changed is not Changed.CACHED and format_file_in_place(
src, fast=fast, write_back=write_back, mode=mode
@@ -6781,8 +6782,8 @@ def filter_cached(cache: Cache, sources: Iterable[Path]) -> Tuple[Set[Path], Set
"""
todo, done = set(), set()
for src in sources:
- src = src.resolve()
- if cache.get(src) != get_cache_info(src):
+ res_src = src.resolve()
+ if cache.get(str(res_src)) != get_cache_info(res_src):
todo.add(src)
else:
done.add(src)
@@ -6794,7 +6795,7 @@ def write_cache(cache: Cache, sources: Iterable[Path], mode: Mode) -> None:
cache_file = get_cache_file(mode)
try:
CACHE_DIR.mkdir(parents=True, exist_ok=True)
- new_cache = {**cache, **{src.resolve(): get_cache_info(src) for src in sources}}
+ new_cache = {**cache, **{str(src.resolve()): get_cache_info(src) for src in sources}}
with tempfile.NamedTemporaryFile(dir=str(cache_file.parent), delete=False) as f:
pickle.dump(new_cache, f, protocol=4)
os.replace(f.name, cache_file) and the associated profile from this |
I see, so the issue is that we pickle Path objects instead of strings in the pickled cache? That does seem like useless overhead; I'd merge a PR making that change. |
++ this would be a big help, thanks @asottile! We recently discovered that our black runs were very slow under Jenkins due to a ~200 MB cache pickle:
Running black on a single file was taking 30 seconds until we cleared this cache (70+ minutes total when run using pre-commit on a large repo, compared to ~2 minutes after the cache was removed). |
@chriskuehl some related thoughts: The Black cache on my company's CI instances is <1 MB on a repo with millions of lines, so I was surprised to see a 200 MB cache. I guess it's because the You could fix the unbounded growth on your side by configuring Black to use a fresh cache for every Jenkins run, or by just periodically deleting the cache. I'm trying to think of something that could be done in Black itself, but not sure there is much we can do. |
@JelleZijlstra you're right, it's because our workspace names are unique and generally only used once. We're running black without the cache now on CI to work around this (setting From our perspective, it would be nice if black cleaned up its cache once it reached a size where loading the cache takes 30+ seconds, but I'm not sure how easy that is to implement or if that will affect other use-cases. |
Describe the bug A clear and concise description of what the bug is.
To Reproduce Steps to reproduce the behavior:
With a sufficiently large number of runs of black, the cache accumulates pretty rapidly and causes significant slowdown compared to uncached runs:
uncached
cached
The cache in question isn't ~that large:
The reason for all the slowdown appears to be because pathlib is slow -- I've attached a pstats and svg of this:
This was generated using this procedure
out.zip (contains svg and pstats file)
Expected behavior A clear and concise description of what you expected to happen.
The cache shouldn't make the execution significantly slower
Environment (please complete the following information):
Does this bug also happen on master? To answer this, you have two options:
yes, if I build up a new cache using the current default branch it reproduces as well
Additional context Add any other context about the problem here.
Removing pathlib from the Cache makes the cache serialization cost ~essentially zero -- will follow up with a demonstration patch
The text was updated successfully, but these errors were encountered: