-
Notifications
You must be signed in to change notification settings - Fork 401
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
include, prov: Change util_av lock to genlock #10803
Conversation
prov/util/src/util_av.c
Outdated
ep_list_lock_type = ofi_progress_lock_type(av->domain->threading, | ||
lock_type = ofi_progress_lock_type(av->domain->threading, | ||
av->domain->control_progress); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the alignment caused by the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think the lock selection here is actually wrong. The progress lock type is more for the completion devices and the AV doesn't fall into that category. Here's the definition of the progress_lock_type:
static inline enum ofi_lock_type
ofi_progress_lock_type(enum fi_threading threading, enum fi_progress control)
{
return (threading == FI_THREAD_DOMAIN || threading == FI_THREAD_COMPLETION) &&
control == FI_PROGRESS_CONTROL_UNIFIED ? OFI_LOCK_NOOP : OFI_LOCK_MUTEX;
}
The AV still should get locked for thread completion. It can only be optimized out for thread domain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aingerson thanks, I misunderstood FI_THREAD_COMPLETION. I will fix
Is this the reason for Intel CI failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard to say for sure but maybe. the failure is a hang on MPICH verbs-rxm on a multi threaded test. MPICH I think uses FI_THREAD_COMPLETION so definitely a chance there's a race because it optimizes the lock out even though it needs it
The util_av lock will be held in the fast path in the EFA provider. With the correct threading and progress models (FI_THREAD_DOMAIN and FI_PROGRESS_CONTROL_UNIFIED), this lock should also be a no-op. Similar to the ep_list_lock. This commit changes the mutex to a genlock that becomes a no-op for the correct threading and progress models and a mutex otherwise. Signed-off-by: Sai Sunku <sunkusa@amazon.com>
08d885d
to
a2e80d6
Compare
@@ -535,10 +534,21 @@ int ofi_av_init_lightweight(struct util_domain *domain, const struct fi_av_attr | |||
*/ | |||
av->context = context; | |||
av->domain = domain; | |||
av->prov = domain->prov; | |||
|
|||
av_lock_type = domain->threading == FI_THREAD_DOMAIN && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is by clang-format 🤷
@aingerson do you have more comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
The util_av lock will be held in the fast path in the EFA provider. With the correct threading and progress models
(FI_THREAD_DOMAIN/FI_THREAD_COMPLETION and FI_PROGRESS_CONTROL_UNIFIED), this lock should also be a no-op. Similar to the ep_list_lock.
This commit changes the mutex to a genlock that becomes a no-op for the correct threading and progress models and a mutex otherwise.