Skip to content

Fix race condition in WebSocketClient connection handling #781

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

sampepose
Copy link

@sampepose sampepose commented Apr 29, 2025

🔗 Issue Links

Provide all JIRA tickets and/or GitHub issues related to this PR, if applicable.

🎯 Goal

The current implementation uses shared mutable state (connected and timeout flags) that can be accessed concurrently from different threads, causing a data race when the WebSocket connects while the main task is checking connection status.

This can be seen from the logs:

WARNING: ThreadSanitizer: data race (pid=75749)
  Write of size 1 at 0x00010aad29f0 by thread T1:
    #0 closure #2 () -> () in StreamVideo.StreamVideo.(connectWebSocketClient in _57FBE95933FF925DEA7AF207F70F5588)() async throws -> () <null> (MyTestApp.debug.dylib:arm64+0x1554088)
    #1 partial apply forwarder for closure #2 () -> () in StreamVideo.StreamVideo.(connectWebSocketClient in _57FBE95933FF925DEA7AF207F70F5588)() async throws -> () <null> (MyTestApp.debug.dylib:arm64+0x15623e4)
    #2 StreamVideo.WebSocketClient.(handle in _EA224C36B0C3BE21B3EC405BB58853DD)(healthcheck: StreamVideo.WrappedEvent, info: StreamVideo.HealthCheckInfo) -> () <null> (MyTestApp.debug.dylib:arm64+0x18cff14)
    #3 StreamVideo.WebSocketClient.webSocketDidReceiveMessage(Foundation.Data) -> () <null> (MyTestApp.debug.dylib:arm64+0x18ceba8)
    #4 protocol witness for StreamVideo.WebSocketEngineDelegate.webSocketDidReceiveMessage(Foundation.Data) -> () in conformance StreamVideo.WebSocketClient : StreamVideo.WebSocketEngineDelegate in StreamVideo <null> (MyTestApp.debug.dylib:arm64+0x18d183c)
    #5 closure #2 @Sendable () -> () in closure #1 @Sendable (Swift.Result<(extension in Foundation):__C.NSURLSessionWebSocketTask.Message, Swift.Error>) -> () in StreamVideo.URLSessionWebSocketEngine.(doRead in _E102C83AE68006C0360AB6BB3CADE04A)() -> () <null> (MyTestApp.debug.dylib:arm64+0x18bfc14)
    #6 partial apply forwarder for closure #2 @Sendable () -> () in closure #1 @Sendable (Swift.Result<(extension in Foundation):__C.NSURLSessionWebSocketTask.Message, Swift.Error>) -> () in StreamVideo.URLSessionWebSocketEngine.(doRead in _E102C83AE68006C0360AB6BB3CADE04A)() -> () <null> (MyTestApp.debug.dylib:arm64+0x18c52a8)
    #7 reabstraction thunk helper from @escaping @callee_guaranteed @Sendable () -> () to @escaping @callee_unowned @convention(block) @Sendable () -> () <null> (MyTestApp.debug.dylib:arm64+0x10b740c)
    #8 __tsan::invoke_and_release_block(void*) <null> (libclang_rt.tsan_iossim_dynamic.dylib:arm64+0x7bccc)
    #9 _dispatch_client_callout <null> (libdispatch.dylib:arm64+0x5738)

  Previous read of size 1 at 0x00010aad29f0 by thread T5:
    #0 (3) suspend resume partial function for StreamVideo.StreamVideo.(connectWebSocketClient in _57FBE95933FF925DEA7AF207F70F5588)() async throws -> () <null> (MyTestApp.debug.dylib:arm64+0x15535cc)
    #1 swift::runJobInEstablishedExecutorContext(swift::Job*) <null> (libswift_Concurrency.dylib:arm64+0x38108)

  Location is heap block of size 17 at 0x00010aad29e0 allocated by thread T1:
    #0 malloc <null> (libclang_rt.tsan_iossim_dynamic.dylib:arm64+0x5397c)
    #1 _malloc_type_malloc_outlined <null> (libsystem_malloc.dylib:arm64+0xf3ec)
    #2 swift::runJobInEstablishedExecutorContext(swift::Job*) <null> (libswift_Concurrency.dylib:arm64+0x38108)

  Thread T1 (tid=35619471, running) is a GCD worker thread

  Thread T5 (tid=35619547, running) is a GCD worker thread

SUMMARY: ThreadSanitizer: data race (/Users/sampepose/Library/Developer/CoreSimulator/Devices/B6999A6E-7380-48B5-85B9-A4E835706191/data/Containers/Bundle/Application/C68BB6C7-D8BF-402F-927A-1187AA422791/MyTestApp.app/MyTestApp.debug.dylib:arm64+0x1554088) in closure #2 () -> () in StreamVideo.StreamVideo.(connectWebSocketClient in _57FBE95933FF925DEA7AF207F70F5588)() async throws -> ()+0x118

📝 Summary

I've refactored the connection handling to use Swift's modern concurrency features to eliminate the data race while maintaining the same timeout and connection logic behavior.

🛠 Implementation

  • Replaced the polling loop with a task group that properly coordinates timeouts and connections
  • Implemented a safe continuation pattern to prevent potential double-resume issues
  • Used a shared handler reference that can be nullified to ensure the continuation is resumed exactly once
  • Added proper weak self references to avoid potential retain cycles

🧪 Manual Testing Notes

  1. Join a call successfully
  2. Attempt to join a call, but time out (must wait the 30 seconds)

See no crashes or ThreadSanitizer complaints in either case.

☑️ Contributor Checklist

  • I have signed the Stream CLA (required)
  • This change follows zero ⚠️ policy (required)
  • This change should receive manual QA
  • Changelog is updated with client-facing changes
  • New code is covered by unit tests
  • Comparison screenshots added for visual changes
  • Affected documentation updated (tutorial, CMS)

@sampepose sampepose requested a review from a team as a code owner April 29, 2025 04:26
@sampepose
Copy link
Author

Hey, I didn't see a CONTRIBUTORS file...hope it's ok to send this bug fix in for consideration!

@sampepose
Copy link
Author

@martinmitrevski @ipavlidakis is there a process to get this reviewed?

@ipavlidakis
Copy link
Contributor

Hi @sampepose,

Thank you for this PR. I would agree with you that this part is racey and requires improvement. However, may I suggest something around the following:

log.debug("Listening for WS connection")
let subject = PassthroughSubject<Void, Error>()
webSocketClient?.onConnected = { subject.send(()) }
try await subject.nextValue(timeout: 30)
log.debug("WS connected")

This one has the benefit of appearing simpler, being more readable, reusing logic that already exists in the SDK and avoids having many small rigid Task.sleep.

Best regards,
Ilias

@sampepose
Copy link
Author

Good idea @ipavlidakis! I made those changes.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants