Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Race condition in botocore's use of JSONFileCache for credential loading #3213

Open
mikelambert opened this issue Jun 27, 2024 · 7 comments
Open
Labels
bug This issue is a confirmed bug. credentials needs-review This issue or pull request needs review from a core team member. p2 This is a standard priority issue

Comments

@mikelambert
Copy link

Describe the bug

I am getting the a rare crash when using botocore to load files.

It's a JSONDecodeError: Expecting value: line 1 column 1 (char 0) stemming from JSONFileCache.__getitem__, which looks to imply the json file is empty (didn't find any value at char 0). Someone helped point out it might be a race condition in the JSONFileCache.__set__ , which appears to do:

            f.truncate()
            f.write(file_content)

We have multiple concurrent processes starting on a box that each are using botocore, so maybe this is just a race condition if one of them happens to look at the file post-truncate-pre-write? Not sure if a flock, or write-then-rename, or something else ends up a proper solution here?

Expected Behavior

It shouldn't crash

Current Behavior

JSONDecodeError: Expecting value: line 1 column 1 (char 0)
  File "botocore/utils.py", line 3518, in __getitem__
    return json.load(f)
  File "__init__.py", line 293, in load
    return loads(fp.read(),
  File "__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None


  File "botocore/client.py", line 553, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "botocore/client.py", line 989, in _make_api_call
    http, parsed_response = self._make_request(
  File "botocore/client.py", line 1015, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "botocore/endpoint.py", line 119, in make_request
    return self._send_request(request_dict, operation_model)
  File "botocore/endpoint.py", line 198, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "botocore/endpoint.py", line 134, in create_request
    self._event_emitter.emit(
  File "botocore/hooks.py", line 412, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "botocore/hooks.py", line 256, in emit
    return self._emit(event_name, kwargs)
  File "botocore/hooks.py", line 239, in _emit
    response = handler(**kwargs)
  File "botocore/signers.py", line 105, in handler
    return self.sign(operation_name, request)
  File "botocore/signers.py", line 186, in sign
    auth = self.get_auth_instance(**kwargs)
  File "botocore/signers.py", line 301, in get_auth_instance
    frozen_credentials = credentials.get_frozen_credentials()
  File "botocore/credentials.py", line 634, in get_frozen_credentials
    self._refresh()
  File "botocore/credentials.py", line 522, in _refresh
    self._protected_refresh(is_mandatory=is_mandatory_refresh)
  File "botocore/credentials.py", line 538, in _protected_refresh
    metadata = self._refresh_using()
  File "botocore/credentials.py", line 685, in fetch_credentials
    return self._get_cached_credentials()
  File "botocore/credentials.py", line 693, in _get_cached_credentials
    response = self._load_from_cache()
  File "botocore/credentials.py", line 711, in _load_from_cache
    creds = deepcopy(self._cache[self._cache_key])
  File "botocore/utils.py", line 3520, in __getitem__
    raise KeyError(cache_key)

Reproduction Steps

Hard to repro, and haven't tried myself. I assume thousands of processes recreating the botocore cached credentials file would do it.

Possible Solution

Perhaps a flock or a write-to-temp-file-then-rename-to-destination-file-address, instead of truncate-then-write?

Additional Information/Context

No response

SDK version used

1.34.42

Environment details (OS name and version, etc.)

AWS instance running x86_64 GNU/Linux

@mikelambert mikelambert added bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Jun 27, 2024
@Laurent-Dx
Copy link

Laurent-Dx commented Jun 27, 2024

The following does reproduce the issue instantly:

from threading import Thread
from botocore.utils import JSONFileCache

cache = JSONFileCache()


def f():
    for i in range(100000):
        cache["key"] = 10
        cache["key"]


threads = []
for i in range(2):
    thread = Thread(target=f, name=f"Thread-{i}")
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

Adding a lock in each of __getitem__, __delitem__, __setitem__ as you suggested does appear to solve the issue.

@tim-finnigan tim-finnigan self-assigned this Jun 28, 2024
@tim-finnigan
Copy link
Contributor

Thanks for reaching out. The error you referenced was also reported here: #3106. I was advised that an error message like this could help clarify the behavior here: https://github.com/boto/botocore/pull/3183/files. Does deleting the cache file fix this for you?

@tim-finnigan tim-finnigan added response-requested Waiting on additional info and feedback. and removed needs-triage This issue or PR still needs to be triaged. labels Jun 28, 2024
@mikelambert
Copy link
Author

Yeah wrapping this in a retry in our app code does work (and is what I did in parallel to filing this bug). Apologies for not realizing it was previously reported. And thank you Laurent for the multi-threaded repro! (Our problematic setup is multi-process, which is why I didn't propose in-process locks)

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. label Jun 29, 2024
@tim-finnigan
Copy link
Contributor

Thanks for following up and confirming, I'll go ahead and close this as a duplicate and we continue tracking the issue in #3106.

@tim-finnigan tim-finnigan closed this as not planned Won't fix, can't repro, duplicate, stale Jul 2, 2024
@tim-finnigan tim-finnigan added the duplicate This issue is a duplicate. label Jul 2, 2024
Copy link

github-actions bot commented Jul 2, 2024

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.

@tim-finnigan tim-finnigan reopened this Jul 11, 2024
@tim-finnigan tim-finnigan removed their assignment Jul 11, 2024
@tim-finnigan tim-finnigan added credentials needs-review This issue or pull request needs review from a core team member. and removed duplicate This issue is a duplicate. labels Jul 11, 2024
@sfc-gh-abalik
Copy link

I'm hitting this same issue - In my case I'm spawning 10 aws s3 cp processes concurrently in Go code using goroutines and a few of them will fail with the JSONDecodeError while the others succeed.

From what I can tell this only happens when the cache file doesn't already exist, so I'm working around it by doing a dummy aws cli call like aws sts get-caller-identity to force the cache file to be written to disk before I spawn the concurrent commands.

@tim-finnigan tim-finnigan added the p2 This is a standard priority issue label Jul 31, 2024
@amberkushwaha
Copy link

markdown is supported security policy and code of conduct is also different as security policy for the proper use

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug This issue is a confirmed bug. credentials needs-review This issue or pull request needs review from a core team member. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

5 participants