Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

include, prov: Change util_av lock to genlock #10803

Merged
merged 1 commit into from
Feb 21, 2025

Conversation

sunkuamzn
Copy link
Contributor

The util_av lock will be held in the fast path in the EFA provider. With the correct threading and progress models
(FI_THREAD_DOMAIN/FI_THREAD_COMPLETION and FI_PROGRESS_CONTROL_UNIFIED), this lock should also be a no-op. Similar to the ep_list_lock.

This commit changes the mutex to a genlock that becomes a no-op for the correct threading and progress models and a mutex otherwise.

@sunkuamzn sunkuamzn requested review from j-xiong, aingerson and a team February 20, 2025 14:59
ep_list_lock_type = ofi_progress_lock_type(av->domain->threading,
lock_type = ofi_progress_lock_type(av->domain->threading,
av->domain->control_progress);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the alignment caused by the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think the lock selection here is actually wrong. The progress lock type is more for the completion devices and the AV doesn't fall into that category. Here's the definition of the progress_lock_type:

static inline enum ofi_lock_type
ofi_progress_lock_type(enum fi_threading threading, enum fi_progress control)
{
	return (threading == FI_THREAD_DOMAIN || threading == FI_THREAD_COMPLETION) &&
		control == FI_PROGRESS_CONTROL_UNIFIED ? OFI_LOCK_NOOP : OFI_LOCK_MUTEX;
}

The AV still should get locked for thread completion. It can only be optimized out for thread domain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aingerson thanks, I misunderstood FI_THREAD_COMPLETION. I will fix

Is this the reason for Intel CI failure?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard to say for sure but maybe. the failure is a hang on MPICH verbs-rxm on a multi threaded test. MPICH I think uses FI_THREAD_COMPLETION so definitely a chance there's a race because it optimizes the lock out even though it needs it

The util_av lock will be held in the fast path in the EFA provider. With the
correct threading and progress models (FI_THREAD_DOMAIN and
FI_PROGRESS_CONTROL_UNIFIED), this lock should also be a no-op. Similar to
the ep_list_lock.

This commit changes the mutex to a genlock that becomes a no-op for the
correct threading and progress models and a mutex otherwise.

Signed-off-by: Sai Sunku <sunkusa@amazon.com>
@@ -535,10 +534,21 @@ int ofi_av_init_lightweight(struct util_domain *domain, const struct fi_av_attr
*/
av->context = context;
av->domain = domain;
av->prov = domain->prov;

av_lock_type = domain->threading == FI_THREAD_DOMAIN &&
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is by clang-format 🤷

@sunkuamzn
Copy link
Contributor Author

@aingerson do you have more comments?

Copy link
Contributor

@aingerson aingerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@sunkuamzn sunkuamzn merged commit f1158e4 into ofiwg:main Feb 21, 2025
13 checks passed
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants