sem-speed-test: deadlocks sometimes when pthread_cancel is called on the threads that are actively using semaphores #1165

stanislaw · 2021-09-22T18:52:37Z

Is your feature request related to a problem? Please describe.

This report is similar to #1160 and #1164 because a resource, in this case a thread, gets destroyed by pthread_cancel while another thread wants to sem_wait or sem_post on semaphore that was just before sem_wait-d or sem_post-d by waiting on by the destroyed thread.

This is how the pthread_join called by OS_TaskDelete gets deadlocked on macOS:

(lldb) thread backtrace 
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007fff203b59ee libsystem_kernel.dylib`__ulock_wait + 10
    frame #1: 0x00007fff203eaf60 libsystem_pthread.dylib`_pthread_join + 362
    frame #2: 0x000000010614f389 sem-speed-test`OS_TaskDelete_Impl(token=0x00007ffee9ac08f0) at os-impl-tasks.c:694:15
    frame #3: 0x000000010614ad9d sem-speed-test`OS_TaskDelete(task_id=65537) at osapi-task.c:239:23
    frame #4: 0x000000010613fea1 sem-speed-test`SemRun at sem-speed-test.c:216:14
    frame #5: 0x0000000106142839 sem-speed-test`UtTest_Run at uttest.c:174:17
    frame #6: 0x0000000106141e29 sem-speed-test`OS_Application_Run at utbsp.c:232:5
    frame #7: 0x0000000106154fca sem-speed-test`main(argc=1, argv=0x00007ffee9ac0a08) at bsp_start.c:247:5
    frame #8: 0x00007fff20404f3d libdyld.dylib`start + 1
    frame #9: 0x00007fff20404f3d libdyld.dylib`start + 1

Is it undefined behavior when a thread gets pthread_cancelled while waiting or posting on a semaphore? This at least seems to be the case on macOS where the pthread_join deadlocks on a cancelled thread.

Describe the solution you'd like

With all due appreciation of the testing setup created in sem-speed-test, the thread loops of the task 1 and task 2 could be managed explicitly as to when their job should be finished so that the pthread_cancel does not catch both threads while they are still managing the semaphores.

Describe alternatives you've considered

For now, I have created a simple hack in the task 1 and task 2: their thread loops both depend on two global variables:

bool      task_1_done = false;
bool      task_2_done = false;

...
while (!task_1_done && task_1_work < SEMTEST_WORK_LIMIT) {
...
}

while (!task_2_done && task_2_work < SEMTEST_WORK_LIMIT) {
...
}

And then before actually deleting the tasks:

    /* Give the initial sem that starts the loop */
    SEMOP(Give)(sem_id_1);

    /* Time Limited Execution */
    OS_TaskDelay(5000);

    // Let the threads finish their job.
    task_1_done = true;
    task_2_done = true;
    OS_TaskDelay(1000);

    // TODO: Deleting task is sometimes OS_SUCCESS and sometimes OS_ERR_INVALID_ID
    status = OS_TaskDelete(task_1_id);
    // UtAssert_True(status == OS_ERR_INVALID_ID, "Task 1 delete Rc=%d", (int)status);

    status = OS_TaskDelete(task_2_id);
    // UtAssert_True(status == OS_ERR_INVALID_ID, "Task 2 delete Rc=%d", (int)status);

With this change, the pthread_cancel followed by pthread_join does not block on macOS.

Additional context

This behavior is 100% reproducible on macOS, branch of the #1161.

I have also applied the Clang's Thread Sanitizer to this and other tests. The thread sanitizer immediately complains about possible races related to unprotected access to the global variables managed by the tests. It could become a separate ticket when the more trivial issues reported so far are resolved.

Requester Info

Stanislav Pankevich (Personal contribution).

The text was updated successfully, but these errors were encountered:

stanislaw mentioned this issue Sep 22, 2021

os/posix: port of the posix implementation to macOS (take 2) #1161

Closed

5 tasks

stanislaw changed the title ~~sem-speed-test: deadlocks sometimes because pthread_cancel is called on one of the threads while the other is waiting on a semaphore~~ sem-speed-test: deadlocks sometimes when pthread_cancel is called on the threads that are actively using semaphores Sep 22, 2021

skliper added bug unit-test Tickets related to the OSAL unit testing (functional and/or coverage) labels Sep 28, 2021

jphickey pushed a commit to jphickey/osal that referenced this issue Aug 10, 2022

Fix nasa#1165, remove configs about shells

1d2036a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sem-speed-test: deadlocks sometimes when pthread_cancel is called on the threads that are actively using semaphores #1165

sem-speed-test: deadlocks sometimes when pthread_cancel is called on the threads that are actively using semaphores #1165

stanislaw commented Sep 22, 2021

sem-speed-test: deadlocks sometimes when pthread_cancel is called on the threads that are actively using semaphores #1165

sem-speed-test: deadlocks sometimes when pthread_cancel is called on the threads that are actively using semaphores #1165

Comments

stanislaw commented Sep 22, 2021