Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Flaky test: 6/100K: Test/LRSClient #7241

Closed
easwars opened this issue May 17, 2024 · 8 comments · Fixed by #7559
Closed

Flaky test: 6/100K: Test/LRSClient #7241

easwars opened this issue May 17, 2024 · 8 comments · Fixed by #7559
Assignees

Comments

@easwars
Copy link
Contributor

easwars commented May 17, 2024

--- FAIL: Test (3.59s)
    --- FAIL: Test/LRSClient (0.31s)
        tlogger.go:116: INFO server.go:685 [core] [Server #1]Server created  (t=+21.534687ms)
        tlogger.go:116: INFO client_new.go:83 [xds] [xds-client 0x4000458aa0] Created client to xDS management server: 127.0.0.1:34355-insecure-xds_v3  (t=+22.759344ms)
        tlogger.go:116: INFO clientconn.go:305 [core] [Channel #2]Channel created  (t=+29.67366ms)
        tlogger.go:116: INFO clientconn.go:1684 [core] [Channel #2]original dial target is: "127.0.0.1:34355"  (t=+29.960129ms)
        tlogger.go:116: INFO clientconn.go:1689 [core] [Channel #2]dial target "127.0.0.1:34355" parse failed: parse "127.0.0.1:34355": first path segment in URL cannot contain colon  (t=+30.745429ms)
        tlogger.go:116: INFO clientconn.go:1710 [core] [Channel #2]fallback to scheme "passthrough"  (t=+30.930598ms)
        tlogger.go:116: INFO clientconn.go:1718 [core] [Channel #2]parsed dial target is: passthrough:///127.0.0.1:34355  (t=+31.903631ms)
        tlogger.go:116: INFO clientconn.go:1841 [core] [Channel #2]Channel authority set to "127.0.0.1:34355"  (t=+32.499704ms)
        tlogger.go:116: INFO server.go:881 [core] [Server #1 ListenSocket #3]ListenSocket created  (t=+38.2307ms)
        tlogger.go:116: INFO resolver_wrapper.go:197 [core] [Channel #2]Resolver state updated: {
              "Addresses": [
                {
                  "Addr": "127.0.0.1:34355",
                  "ServerName": "",
                  "Attributes": null,
                  "BalancerAttributes": null,
                  "Metadata": null
                }
              ],
              "Endpoints": [
                {
                  "Addresses": [
                    {
                      "Addr": "127.0.0.1:34355",
                      "ServerName": "",
                      "Attributes": null,
                      "BalancerAttributes": null,
                      "Metadata": null
                    }
                  ],
                  "Attributes": null
                }
              ],
              "ServiceConfig": null,
              "Attributes": null
            } (resolver returned new addresses)  (t=+44.936513ms)
        tlogger.go:116: INFO balancer_wrapper.go:103 [core] [Channel #2]Channel switches to new LB policy "pick_first"  (t=+48.023638ms)
        tlogger.go:116: INFO clientconn.go:849 [core] [Channel #2 SubChannel #4]Subchannel created  (t=+50.178177ms)
        tlogger.go:116: INFO clientconn.go:535 [core] [Channel #2]Channel Connectivity change to CONNECTING  (t=+51.25743ms)
        tlogger.go:116: INFO clientconn.go:305 [core] [Channel #2]Channel exiting idle mode  (t=+53.929595ms)
        tlogger.go:116: INFO transport.go:238 [xds] [xds-client 0x4000458aa0] [127.0.0.1:34355] Created transport to server "127.0.0.1:34355"  (t=+54.756943ms)
        tlogger.go:116: INFO clientconn.go:1210 [core] [Channel #2 SubChannel #4]Subchannel Connectivity change to CONNECTING  (t=+61.40104ms)
        tlogger.go:116: INFO clientconn.go:1326 [core] [Channel #2 SubChannel #4]Subchannel picks a new address "127.0.0.1:34355" to connect  (t=+62.048319ms)
        tlogger.go:116: INFO clientconn.go:1210 [core] [Channel #2 SubChannel #4]Subchannel Connectivity change to READY  (t=+86.290458ms)
        tlogger.go:116: INFO clientconn.go:535 [core] [Channel #2]Channel Connectivity change to READY  (t=+86.699138ms)
        tlogger.go:116: INFO loadreport.go:115 [xds] [xds-client 0x4000458aa0] [127.0.0.1:34355] Created LRS stream to server "127.0.0.1:34355"  (t=+89.240095ms)
        tlogger.go:116: INFO server.go:685 [core] [Server #7]Server created  (t=+95.62849ms)
        tlogger.go:116: INFO clientconn.go:305 [core] [Channel #8]Channel created  (t=+96.048701ms)
        tlogger.go:116: INFO clientconn.go:1684 [core] [Channel #8]original dial target is: "127.0.0.1:39569"  (t=+96.190278ms)
        tlogger.go:116: INFO clientconn.go:1689 [core] [Channel #8]dial target "127.0.0.1:39569" parse failed: parse "127.0.0.1:39569": first path segment in URL cannot contain colon  (t=+96.323208ms)
        tlogger.go:116: INFO clientconn.go:1710 [core] [Channel #8]fallback to scheme "passthrough"  (t=+96.440329ms)
        tlogger.go:116: INFO clientconn.go:1718 [core] [Channel #8]parsed dial target is: passthrough:///127.0.0.1:39569  (t=+96.575033ms)
        tlogger.go:116: INFO clientconn.go:1841 [core] [Channel #8]Channel authority set to "127.0.0.1:39569"  (t=+96.687054ms)
        tlogger.go:116: INFO transport.go:337 [xds] [xds-client 0x4000458aa0] [127.0.0.1:34355] ADS stream created  (t=+100.191736ms)
        tlogger.go:116: INFO server.go:881 [core] [Server #7 ListenSocket #9]ListenSocket created  (t=+100.993606ms)
        tlogger.go:116: INFO resolver_wrapper.go:197 [core] [Channel #8]Resolver state updated: {
              "Addresses": [
                {
                  "Addr": "127.0.0.1:39569",
                  "ServerName": "",
                  "Attributes": null,
                  "BalancerAttributes": null,
                  "Metadata": null
                }
              ],
              "Endpoints": [
                {
                  "Addresses": [
                    {
                      "Addr": "127.0.0.1:39569",
                      "ServerName": "",
                      "Attributes": null,
                      "BalancerAttributes": null,
                      "Metadata": null
                    }
                  ],
                  "Attributes": null
                }
              ],
              "ServiceConfig": null,
              "Attributes": null
            } (resolver returned new addresses)  (t=+101.514879ms)
        tlogger.go:116: INFO balancer_wrapper.go:103 [core] [Channel #8]Channel switches to new LB policy "pick_first"  (t=+101.803462ms)
        tlogger.go:116: INFO clientconn.go:849 [core] [Channel #8 SubChannel #10]Subchannel created  (t=+102.241828ms)
        tlogger.go:116: INFO clientconn.go:535 [core] [Channel #8]Channel Connectivity change to CONNECTING  (t=+102.45511ms)
        tlogger.go:116: INFO clientconn.go:305 [core] [Channel #8]Channel exiting idle mode  (t=+102.675185ms)
        tlogger.go:116: INFO transport.go:238 [xds] [xds-client 0x4000458aa0] [127.0.0.1:39569] Created transport to server "127.0.0.1:39569"  (t=+103.031776ms)
        tlogger.go:116: INFO clientconn.go:1210 [core] [Channel #8 SubChannel #10]Subchannel Connectivity change to CONNECTING  (t=+103.614374ms)
        tlogger.go:116: INFO clientconn.go:1326 [core] [Channel #8 SubChannel #10]Subchannel picks a new address "127.0.0.1:39569" to connect  (t=+103.742445ms)
        tlogger.go:116: INFO clientconn.go:1210 [core] [Channel #8 SubChannel #10]Subchannel Connectivity change to READY  (t=+118.145104ms)
        tlogger.go:116: INFO clientconn.go:535 [core] [Channel #8]Channel Connectivity change to READY  (t=+118.309694ms)
        tlogger.go:116: INFO loadreport.go:115 [xds] [xds-client 0x4000458aa0] [127.0.0.1:39569] Created LRS stream to server "127.0.0.1:39569"  (t=+118.522004ms)
        tlogger.go:116: INFO transport.go:337 [xds] [xds-client 0x4000458aa0] [127.0.0.1:39569] ADS stream created  (t=+119.393746ms)
        tlogger.go:116: INFO loadreport.go:86 [xds] [xds-client 0x4000458aa0] [127.0.0.1:39569] Stopping LRS stream  (t=+228.638594ms)
        tlogger.go:116: WARNING transport.go:479 [xds] [xds-client 0x4000458aa0] [127.0.0.1:39569] ADS stream closed: rpc error: code = Canceled desc = context canceled  (t=+232.482204ms)
        tlogger.go:116: INFO clientconn.go:535 [core] [Channel #8]Channel Connectivity change to SHUTDOWN  (t=+233.124013ms)
        tlogger.go:116: INFO resolver_wrapper.go:100 [core] [Channel #8]Closing the name resolver  (t=+233.361571ms)
        tlogger.go:116: INFO balancer_wrapper.go:135 [core] [Channel #8]ccBalancerWrapper: closing  (t=+233.662879ms)
        tlogger.go:116: INFO clientconn.go:1210 [core] [Channel #8 SubChannel #10]Subchannel Connectivity change to SHUTDOWN  (t=+234.526215ms)
        tlogger.go:116: INFO clientconn.go:1557 [core] [Channel #8 SubChannel #10]Subchannel deleted  (t=+234.780193ms)
        tlogger.go:116: INFO clientconn.go:305 [core] [Channel #8]Channel deleted  (t=+238.871431ms)
        loadreport_test.go:133: unexpected LRS request: &{ <nil>}, <nil>, want error canceled
        tlogger.go:116: INFO server.go:817 [core] [Server #7 ListenSocket #9]ListenSocket deleted  (t=+244.793196ms)
        tlogger.go:116: INFO loadreport.go:86 [xds] [xds-client 0x4000458aa0] [127.0.0.1:34355] Stopping LRS stream  (t=+246.060213ms)
        tlogger.go:116: WARNING loadreport.go:124 [xds] [xds-client 0x4000458aa0] [127.0.0.1:34355] Reading from LRS stream failed: failed to receive first LoadStatsResponse: rpc error: code = Canceled desc = context canceled  (t=+246.42993ms)
        tlogger.go:116: WARNING transport.go:479 [xds] [xds-client 0x4000458aa0] [127.0.0.1:34355] ADS stream closed: rpc error: code = Canceled desc = context canceled  (t=+246.845854ms)
        tlogger.go:116: INFO clientconn.go:535 [core] [Channel #2]Channel Connectivity change to SHUTDOWN  (t=+246.989184ms)
        tlogger.go:116: INFO resolver_wrapper.go:100 [core] [Channel #2]Closing the name resolver  (t=+247.109821ms)
        tlogger.go:116: INFO balancer_wrapper.go:135 [core] [Channel #2]ccBalancerWrapper: closing  (t=+247.23124ms)
        tlogger.go:116: INFO clientconn.go:1210 [core] [Channel #2 SubChannel #4]Subchannel Connectivity change to SHUTDOWN  (t=+247.48077ms)
        tlogger.go:116: INFO clientconn.go:1557 [core] [Channel #2 SubChannel #4]Subchannel deleted  (t=+247.610083ms)
        tlogger.go:116: INFO clientconn.go:305 [core] [Channel #2]Channel deleted  (t=+248.222988ms)
        tlogger.go:116: INFO clientimpl.go:100 [xds] [xds-client 0x4000458aa0] Shutdown  (t=+248.380284ms)
        tlogger.go:116: INFO server.go:817 [core] [Server #1 ListenSocket #3]ListenSocket deleted  (t=+248.595419ms)
FAIL
FAIL	google.golang.org/grpc/xds/internal/xdsclient	7.155s

https://github.com/grpc/grpc-go/actions/runs/9134961110/job/25121494558?pr=7240

@easwars
Copy link
Contributor Author

easwars commented May 17, 2024

No failures in 10K runs on Forge.

@easwars easwars changed the title Flaky test: Test/LRSClient Flaky test: 6/100K: Test/LRSClient May 17, 2024
@purnesh42H purnesh42H self-assigned this May 20, 2024
@easwars
Copy link
Contributor Author

easwars commented May 22, 2024

https://github.com/grpc/grpc-go/actions/runs/9134961110/job/25121494558?pr=7240

@easwars
Copy link
Contributor Author

easwars commented Aug 14, 2024

@arvindbr8
Copy link
Member

@dfawley
Copy link
Member

dfawley commented Aug 21, 2024

Happened to me today. Maybe this is more common now for some reason?

@dfawley dfawley removed the P3 label Aug 21, 2024
@arjan-bal
Copy link
Contributor

I suspect what's happening here is that that we're sometimes getting a second load report before the client closes the stream. We're setting the load report interval to 50 ms here:

Resp: &v3lrspb.LoadStatsResponse{
SendAllClusters: true,
LoadReportingInterval: &durationpb.Duration{Nanos: 50000000},
},

In the final assertion of the test, we're checking that the very next request present in the fake server's request channel is the client cancellation:

if u, err := fs2.LRSRequestChan.Receive(ctx); err != nil || status.Code(u.(*fakeserver.Request).Err) != codes.Canceled {
t.Errorf("unexpected LRS request: %v, %v, want error canceled", u, err)
}

When the the test takes more than 50 ms b/w sending the first LRS response with the load report interval and this final assertion, another load report can be sent by the client here:

func (t *Transport) sendLoads(ctx context.Context, stream lrsStream, clusterNames []string, interval time.Duration) {
tick := time.NewTicker(interval)
defer tick.Stop()
for {
select {
case <-tick.C:
case <-ctx.Done():
return
}
if err := t.sendLoadStatsRequest(stream, t.lrsStore.Stats(clusterNames)); err != nil {
t.logger.Warningf("Writing to LRS stream failed: %v", err)
return
}
}
}

I was able to make the test fail with a similar error message (loadreport_test.go:139: unexpected LRS request: &{ <nil>}, <nil>, want error canceled) by adding a time.Sleep(100 * time.Millisecond) just after receiving the first LRS request on the server here:

u, err := fs2.LRSRequestChan.Receive(ctx)
if err != nil {
t.Fatalf("unexpected LRS request: %v, %v, want error canceled", u, err)
}

A possible fix would be to read from fs2.LRSRequestChan.Receive until we get an error due to context expiration or see the expected client context cancellation response.

	// Server should receive a stream canceled error.
	for {
		u, err := fs2.LRSRequestChan.Receive(ctx)
		if err != nil {
			t.Fatalf("unexpected error while reading LRS request: %v", err)
		}
		if u.(*fakeserver.Request).Err == nil {
			continue
		}
		if status.Code(u.(*fakeserver.Request).Err) != codes.Canceled {
			t.Errorf("unexpected LRS request: %v, want error canceled", u)
		}
		break
	}

@arvindbr8
Copy link
Member

@arjan-bal
Copy link
Contributor

Verified 0 failures in 100k runs after the fix.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants