Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ProcessInfo: save atomically #104

Merged
merged 1 commit into from
Jan 12, 2023
Merged

ProcessInfo: save atomically #104

merged 1 commit into from
Jan 12, 2023

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Jan 6, 2023

delete=False,
) as tmp:
json.dump(self.asdict(), tmp)
os.replace(tmp.name, filename)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How likely is it that this file gets overwritten? Is it only written once?

In Windows, os.replace may fail with PermissionError, if there are open file handles to destination, so for making it robust, we may need to retry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very often, in manager.py we will update the process info via __setitem__ if there are some status updates on queue tasks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karajan1001, can you show me where we overwrite this? And when? And what do you mean by "very often", like how frequently that happens?

Apologies for asking too many questions, I'm still trying to understand the codebase. :)

Copy link
Member Author

@skshetry skshetry Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although os.replace() can be fixed separately, we need a timeout based on how much load we are putting in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm sorry looking into it I found that I had mixed ProcessInfo with ExecutorInfo, here the ProcessInfo only dumps in

@codecov-commenter
Copy link

codecov-commenter commented Jan 6, 2023

Codecov Report

Base: 78.18% // Head: 78.39% // Increases project coverage by +0.21% 🎉

Coverage data is based on head (1296787) compared to base (59caee6).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #104      +/-   ##
==========================================
+ Coverage   78.18%   78.39%   +0.21%     
==========================================
  Files          21       21              
  Lines         935      935              
  Branches      149      148       -1     
==========================================
+ Hits          731      733       +2     
+ Misses        176      175       -1     
+ Partials       28       27       -1     
Impacted Files Coverage Δ
src/dvc_task/proc/process.py 84.00% <100.00%> (ø)
src/dvc_task/worker/temporary.py 83.60% <0.00%> (+3.27%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@@ -50,12 +51,17 @@ def asdict(self) -> Dict[str, Any]:

def dump(self, filename: str) -> None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In load() above, do we need to serialize loading with lock/unlock? Are there any side-effects from load()? Is it possible to get rid of it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only problem is that we might do some updates on this file. If the replacement progress is also atomic we can get rid of it.

@skshetry
Copy link
Member Author

skshetry commented Jan 6, 2023

I ran it 100 Windows jobs in CI, 9 failed (8 of them were test_apply_failed which is unrelated, one failure is due to PermissionError for which I have no idea why it happens yet).

See https://github.com/iterative/dvc/actions/runs/3854804740

@skshetry
Copy link
Member Author

ping @karajan1001 @pmrowla.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test failure: test_run_celery fails with JSONDecodeError
3 participants