net/http: setting a timeout on http requests that use TLS can result in excessive requests to server #59017

mpoindexter · 2023-03-14T00:39:12Z

What version of Go are you using (`go version`)?

go version go1.20.2 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

GOARCH="amd64"
GOOS="darwin"

What did you do?

When running a large number of concurrent requests to Amazon S3 with a request timeout set (either in the http.Client instance or a context directly set on the request), we noticed an unexpectedly large amount of long running requests. When we removed the timeout, the number of long running requests dropped. The long running requests were not directly caused by the timeout being hit - all requests completed in under the timeout.

We created a standalone program to reproduce the problem and added logging via httptrace. In the output from httptrace we observed a large number of requests that were reported with the error context cancelled to the TLSHandshakeDone callback in our trace. These requests did not cause failed requests as reported by the http client.

Digging into the http Transport code, it appears that when a connection is not immediately available for use in the connection pool, the runtime starts a race between obtaining a connection returned to the pool and dialing a new connection. In our case, the "obtain connection returned to the pool" was generally winning the race. The behavior on the losing side of the race differed depending on whether the request used a timeout or not. On requests without a timeout, the losing leg of the connection continued through the TLS handshake, and was then placed into the pool as a valid connection. On requests with a timeout, the losing leg was aborted mid-TLS-handshake due to the cancellation of the request context as the request completed using the connection that was returned to the pool.

The net result of this behavior was that whenever a request legitimately required a new connection to be established, it was often queued up (probably at the server end) behind a large number of TLS handshakes that would be cancelled in flight. This manifested as excessive time to complete the request and noticeably lower throughput.

What did you expect to see?

Client does not produce a large volume of aborted TLS handshakes to server.

What did you see instead?

Slowness caused by excessive TLS handshakes to server

I created a gist that reproduces this at https://gist.github.com/mpoindexter/f6bc9dac16290343efba17129c49f9d5. If you uncomment the timeout on line 56 you can see the throughput of the test drop and stall periodically, but if the timeout remains commented throughput remains steady.

To test the analysis above, I implemented a custom DialTLSContext function that did the TLS dial using a new context if a deadline was set on the original context. This resolved the problem.

func ApplyFix(transport *http.Transport) {
	dialer := &tls.Dialer{
		Config: transport.TLSClientConfig,
	}

	transport.DialTLSContext = func(ctx context.Context, network, addr string) (net.Conn, error) {
		deadline, hasDeadline := ctx.Deadline()
		var dialCtx context.Context
		if hasDeadline {
			timeout := deadline.Sub(time.Now())
			if timeout > 0 {
				var cancel context.CancelFunc
				dialCtx, cancel = context.WithTimeout(context.Background(), timeout)
				defer cancel()
			} else {
				dialCtx = ctx
			}
		} else {
			dialCtx = ctx
		}

		return dialer.DialContext(dialCtx, network, addr)
	}
}

The text was updated successfully, but these errors were encountered:

seankhliao · 2023-03-14T02:39:41Z

cc @neild

mark4z · 2023-04-04T01:52:05Z

same question when using nginx as the server

neild · 2023-04-06T21:31:43Z

We shouldn't be using an individual request's context to dial a connection which may be reused for many requests, to avoid problems exactly like this.

I suspect that someone is relying on passing values down to the dial via the context, however, so we likely need to decouple the request context's values from its cancelation. (Conveniently, #40221 has been implemented now, so this is feasible.)

I'm not certain what the correct cancelation behavior is. Perhaps cancel a dial once there are no requests blocked on it? There might still be some access patterns where we repeatedly cancel an in-progress dial, but this would at least avoid the scenario where we cancel a dial call that a request was hoping to make use of.

jarreds · 2024-01-29T19:12:02Z

Using the ApplyFix outlined by @mpoindexter, we saw a significant drop in the number of aborted TLS handshakes. Red line is the deploy of that fix:

Those graphed metrics are directly out of net/http/httptrace.

One thing that's not clear to me at the moment is if those dialed connections are pooled as idle connections, because metrics on the connection pool are not readily accessible.

AlexanderYastrebov · 2024-03-22T14:20:46Z

I suspect that someone is relying on passing values down to the dial via the context

The transport itself uses it for client trace (but it could be propagated from transportRequest if necessary)

go/src/net/http/transport.go

Lines 1629 to 1639 in 27f41bb

    
           func (t *Transport) dialConn(ctx context.Context, cm connectMethod) (pconn *persistConn, err error) { 
        
           	pconn = &persistConn{ 
        
           		t:             t, 
        
           		cacheKey:      cm.key(), 
        
           		reqch:         make(chan requestAndChan, 1), 
        
           		writech:       make(chan writeRequest, 1), 
        
           		closech:       make(chan struct{}), 
        
           		writeErrCh:    make(chan error, 1), 
        
           		writeLoopDone: make(chan struct{}), 
        
           	} 
        
           	trace := httptrace.ContextClientTrace(ctx)

~~Also means that "ApplyFix" workaround likely breaks client tracing.~~ No, the fix replaces context after dialConn obtains client trace.

neild · 2024-04-01T23:12:25Z

Using context.WithoutCancel, it's fairly straightforward to change the transport dial to inherit the values from the request context, but not the cancelation. This seems like a decent balance between backwards compatibility (we still pass through values from the request context) and reasonable behavior (we don't cancel a dial just because a cached connection became available).

I have one question which I'm not sure on the answer to: How, if at all, should we bound the lifetime of dials? If a user creates a Transport, makes a request, and cancels the request, should the dial call continue indefinitely? Or should there be some way of cleaning up stale in-progress dials?

gopherbot · 2024-04-04T18:10:13Z

Change https://go.dev/cl/576555 mentions this issue: net/http: don't cancel Dials when requests are canceled

kassiansun · 2024-08-27T07:49:44Z

Maybe we can introduce an optional TLSDialTimeout to control the default timeout of TLS dialing process?

seankhliao added Performance NeedsInvestigation labels Mar 14, 2023

This comment was marked as spam.

# to view

neild self-assigned this Apr 6, 2023

neild added help wanted NeedsFix and removed NeedsInvestigation labels Apr 6, 2023

AlexanderYastrebov mentioned this issue Jul 22, 2023

net/http: http.Transport leaks context from the request #50798

Closed

This comment was marked as off-topic.

# to view

AlexanderYastrebov mentioned this issue Mar 21, 2024

net/http: dial should not be canceled if an idle connection is reused #66442

Closed

dmitshur added the FixPending label Apr 17, 2024

dmitshur added this to the Go1.23 milestone Apr 17, 2024

gopherbot closed this as completed in 334ce51 Apr 17, 2024

aarongable mentioned this issue Aug 22, 2024

Add go1.23.0 to CI letsencrypt/boulder#7665

Merged

mpoindexter mentioned this issue Aug 27, 2024

net/http: Transport wastes TLS handshake progress #50984

Open

neild mentioned this issue Feb 19, 2025

runtime: increased memory usage in 1.23 with AzCopy #71308

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

net/http: setting a timeout on http requests that use TLS can result in excessive requests to server #59017

net/http: setting a timeout on http requests that use TLS can result in excessive requests to server #59017

mpoindexter commented Mar 14, 2023 •

edited

Loading

seankhliao commented Mar 14, 2023

This comment was marked as spam.

mark4z commented Apr 4, 2023

neild commented Apr 6, 2023

This comment was marked as off-topic.

jarreds commented Jan 29, 2024 •

edited

Loading

AlexanderYastrebov commented Mar 22, 2024 •

edited

Loading

neild commented Apr 1, 2024

gopherbot commented Apr 4, 2024

kassiansun commented Aug 27, 2024

net/http: setting a timeout on http requests that use TLS can result in excessive requests to server #59017

net/http: setting a timeout on http requests that use TLS can result in excessive requests to server #59017

Comments

mpoindexter commented Mar 14, 2023 • edited Loading

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

seankhliao commented Mar 14, 2023

This comment was marked as spam.

mark4z commented Apr 4, 2023

neild commented Apr 6, 2023

This comment was marked as off-topic.

jarreds commented Jan 29, 2024 • edited Loading

AlexanderYastrebov commented Mar 22, 2024 • edited Loading

neild commented Apr 1, 2024

gopherbot commented Apr 4, 2024

kassiansun commented Aug 27, 2024

mpoindexter commented Mar 14, 2023 •

edited

Loading

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

jarreds commented Jan 29, 2024 •

edited

Loading

AlexanderYastrebov commented Mar 22, 2024 •

edited

Loading