Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[FEATURE] Reduce support issues related to concurrent writes #957

Open
1 task done
hulkingshtick opened this issue Sep 2, 2024 · 3 comments
Open
1 task done

[FEATURE] Reduce support issues related to concurrent writes #957

hulkingshtick opened this issue Sep 2, 2024 · 3 comments

Comments

@hulkingshtick
Copy link

Is there an existing feature request for this?

  • I have searched the existing feature requests

Is your feature request related to a problem? Please describe.

The documentation says:

Applications are responsible for ensuring that no more than one goroutine calls the write methods (NextWriter, SetWriteDeadline, WriteMessage, WriteJSON, EnableWriteCompression, SetCompressionLevel) concurrently

To help developers detect concurrency errors in their application, the connection makes a best effort to detect concurrent calls to WriteMessage, WriteJSON and Write on the writer returned from NextWriter. When a concurrent calls is detected, the connection panics with the string concurrent write to websocket connection. This panic is better than the alternatives such as commingling messages on the underlying network connection or a nil pointer exception.

The concurrent write to websocket connection panic is the most common support issue for the package. The issue arises because developers incorrectly assume that concurrency is allowed or wrongly think they covered all the bases with regards to ensuring a single writer.

Describe the solution that you would like.

Here are options for improving the situation:

Improve the panic string

The terse panic string does not sufficiently describe the problem or a fix. Create a markdown file on this repository explaining the concurrency restriction and suggested fixes. Include a link to the markdown file in the panic string.

Improve the detector

The current detector is:

if c.isWriting {
	panic("concurrent write to websocket connection")
}
c.isWriting = true

err := c.write(w.frameType, c.writeDeadline, c.writeBuf[framePos:w.pos], extra)

if !c.isWriting {
	panic("concurrent write to websocket connection")
}
c.isWriting = false

The second concurrent writer panics, but not the first. The stack trace for the first concurrent writer may be the key to finding the bug in the application. Cause the first writer to panic using the following code. In this code, the field c.writing is an int and replaces the field c.isWriting.

if c.Writing != 0 {
            c.Writing = 2
	panic("concurrent write to websocket connection")
}
c.Writing = 1

err := c.write(w.frameType, c.writeDeadline, c.writeBuf[framePos:w.pos], extra)

if c.Writing != 1 {
	panic("concurrent write to websocket connection")
}
c.Writing = 0

(this code requires more review and testing).

Return errors instead of panicking

Return an error instead of panicking. Include the stack trace of the caller in the error string.

Describe alternatives you have considered.

I considered and rejected making the existing methods thread safe. It's possible to make the individual methods thread safe, but the SetWriteDeadline, EnableWriteCompression and SetCompressionLevel methods require synchronization at the application level to be useful. Because robust applications should call SetWriteDeadline, a robust application will implement synchronization at the application level. It follows that there's no point in making the individual methods thread safe.

I also considered changing the API to accept a WriteOptions struct for NextWriter, WriteMessage and WriteJSON. The options struct replaces the SetWriteDeadline, EnableWriteCompression and SetCompressionLevel methods. I rejected this idea because it's a breaking change to the API.

Anything else?

No response

@jaitaiwan
Copy link
Member

Thanks @hulkingshtick for this thorough write-up. Based on what you've written and the previous threads on this, I'd say we should probably perform a combo of Return errors instead of panicking as well as a dedicated MD Doc as you suggested in Improve the panic string to tackle this.

I'd prioritise any PRs done for this to be released asap.

@hulkingshtick
Copy link
Author

hulkingshtick commented Sep 2, 2024

The more I think about it, panic is better for developers than returning an error. Because many developers ignore errors, applications will silently fail when an error is returned. The panic makes the developers aware that there is a programming error in their application.


If the panicking goroutine does not recover from the panic, then the application exits with a dump of all goroutine stack traces. It should be easy for developers find the two concurrent writers from those stack traces.

I suspect that the panicking goroutine is often a goroutine started by the net/http server. These goroutines recover from panics and print the stack trace of the panicking goroutine only. It's difficult to find the other concurrent writer with one stack trace.

This simple change can cause the other goroutine to panic:

if c.isWriting {
	c.isWriting = false // <--- add this line
	panic("concurrent write to websocket connection")
}
c.isWriting = true

err := c.write(w.frameType, c.writeDeadline, c.writeBuf[framePos:w.pos], extra)

if !c.isWriting {
	panic("concurrent write to websocket connection")
}
c.isWriting = false

Here's a demonstration of this change: https://go.dev/play/p/ItZzymA0OEf. Notice how the application prints stack traces for example1 and example2. The application prints the stack trace for example2 only when the new c.isWriting = false line is deleted.


Because the concurrency detector is only executed from functions where concurrency is not allowed, there are two possible causes of the panic: the application called the write methods concurrently or there's a bug in the concurrency detector.

If there's a bug in the concurrency detector, then there's a code path where a single threaded program will panic. I examined all of the connection code and cannot concoct a code path where a single threaded program will panic. The field isWriting starts as false and set to false on return from c.flushFrame(). From there, we can conclude that c.isWriting is false on entry to c.flushFrame().

@hulkingshtick
Copy link
Author

Another challenge with making the connection write methods thread-safe:

NextWriter() returns an io.WriteCloser on the message and Close() on that io.WriteCloser finalizes the message. NextWriter() automatically finalizes a pending message before starting the next message. Applications in the wild may rely on message finalization in NextWriter().

There's not a good way to make the connection write methods thread-safe given the previous paragraph. If NextWriter() starts by locking a write mutex, applications that rely on message finalization in NextWriter will block forever. If NextWriter() barges through the lock, the methods is not thread-safe.

The connection write methods were designed for single threaded use and is easy to use the methods that way.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants