Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Client balancer with linearized request does not work under network partition #8660

Closed
gyuho opened this issue Oct 6, 2017 · 1 comment · Fixed by #8669
Closed

Client balancer with linearized request does not work under network partition #8660

gyuho opened this issue Oct 6, 2017 · 1 comment · Fixed by #8669
Assignees
Milestone

Comments

@gyuho
Copy link
Contributor

gyuho commented Oct 6, 2017

Problem

With current master branch with health balancer(#8545):

  1. Set up 3-node etcd cluster: A, B, and C
  2. Client D is sending linearizable request to A,B,C
    • quorum get, mutable operations: put, txn, delete
    • keep sending requests in for-loop [1]
  3. Client balancer pins endpoint A, for instance
  4. Now, A gets network-partitioned
    • disconnected to/from B and C, but not to/from client D
     sudo iptables -A OUTPUT -p tcp --destination-port PEER_PORT -j DROP
     sudo iptables -A INPUT -p tcp --destination-port PEER_PORT -j DROP
    
  5. Client D errors

    clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded
    INFO: 2017/10/06 20:02:32 Client received GoAway with http2.ErrCodeEnhanceYourCalm.
    INFO: 2017/10/06 20:02:32 clientv3/retry: retry for error rpc error: code = Unavailable desc = transport is closing
    INFO: 2017/10/06 20:02:32 clientv3/balancer: unpin A:2379 (grpc: failed with network I/O error)
    INFO: 2017/10/06 20:02:32 clientv3/auth-retry: retry for error rpc error: code = Unavailable desc = transport is closing

  6. Expect unpinned endpoint A is gray-listed, and then A does not get retried
  7. What is actually happening
    a. client D is not isolated from A; healthpb server is still serving
    b. thus, A is never gray-listed
    c. client D balancer retries endpoint A
    d. repeats step 5
    e. stuck with endpoint A (no endpoint switch)

See full logs [2].

Solution

Gray-list pinned endpoint on context.DeadlineExceeded error (in retry code path).

Workaround

Pass cli.Put(clientv3.WithRequireLeader(ctx), "foo", "bar") or cli.Get(clientv3.WithRequireLeader(ctx), "foo") makes etcd server return rpc error: code = Unavailable desc = etcdserver: no leader to client, triggering balancer to gray-list the isolated endpoint and switch to the other.

Reference

Directly related


[1] Click to expand

clientv3.SetLogger(grpclog.NewLoggerV2WithVerbosity(os.Stderr, os.Stderr, os.Stderr, 4))

cli, err := clientv3.New(clientv3.Config{
Endpoints: eps,
DialKeepAliveTime: 2 * time.Second,
DialKeepAliveTimeout: 2 * time.Second,
})
if err != nil {
panic(err)
}
defer cli.Close()

time.Sleep(3 * time.Second)

fmt.Println("CREATING GET!")
println()

for {
fmt.Println("READING...")
time.Sleep(5 * time.Second)
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
_, err = cli.Get(ctx, "foo")
cancel()
if err != nil {
fmt.Println(err)
} else {
fmt.Println("success!")
}
time.Sleep(3 * time.Second)
}

[2] Click to expand

PUT

INFO: 2017/10/06 20:01:31 clientv3/balancer: pin 10.138.0.21:2379
INFO: 2017/10/06 20:01:31 clientv3/balancer: 10.138.0.22:2379 is up but not pinned (already pinned 10.138.0.21:2379)
INFO: 2017/10/06 20:01:31 clientv3/balancer: 10.138.0.23:2379 is up but not pinned (already pinned 10.138.0.21:2379)
CREATING PUT!

WRITING...
success!
WRITING...
success!
WRITING...
success!
WRITING...
success!
WRITING...
success!
WRITING...
INFO: 2017/10/06 20:02:24 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:02:24 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...
INFO: 2017/10/06 20:02:32 Client received GoAway with http2.ErrCodeEnhanceYourCalm.
INFO: 2017/10/06 20:02:32 clientv3/retry: retry for error rpc error: code = Unavailable desc = transport is closing (10.138.0.21:2379)
INFO: 2017/10/06 20:02:32 clientv3/balancer: unpin 10.138.0.21:2379 (grpc: failed with network I/O error)
INFO: 2017/10/06 20:02:32 clientv3/auth-retry: retry for error rpc error: code = Unavailable desc = transport is closing (10.138.0.21:2379)
rpc error: code = Unavailable desc = transport is closing
WARNING: 2017/10/06 20:02:32 Failed to dial 10.138.0.23:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:02:32 Failed to dial 10.138.0.23:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:02:32 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:02:32 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
INFO: 2017/10/06 20:02:32 get error from resetTransport grpc: the connection is closing, transportMonitor returning
INFO: 2017/10/06 20:02:32 clientv3/balancer: 10.138.0.21:2379 is healthy
INFO: 2017/10/06 20:02:32 clientv3/balancer: pin 10.138.0.21:2379
INFO: 2017/10/06 20:02:32 clientv3/balancer: 10.138.0.23:2379 is up but not pinned (already pinned 10.138.0.21:2379)
WARNING: 2017/10/06 20:02:32 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:02:32 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WRITING...
INFO: 2017/10/06 20:02:45 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:02:45 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...
INFO: 2017/10/06 20:02:58 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:02:58 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...
INFO: 2017/10/06 20:03:11 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:03:11 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...
INFO: 2017/10/06 20:03:19 Client received GoAway with http2.ErrCodeEnhanceYourCalm.
INFO: 2017/10/06 20:03:19 clientv3/balancer: unpin 10.138.0.21:2379 (grpc: failed with network I/O error)
INFO: 2017/10/06 20:03:19 clientv3/retry: retry for error rpc error: code = Unavailable desc = transport is closing (10.138.0.21:2379)
INFO: 2017/10/06 20:03:19 clientv3/auth-retry: retry for error rpc error: code = Unavailable desc = transport is closing (10.138.0.21:2379)
rpc error: code = Unavailable desc = transport is closing
INFO: 2017/10/06 20:03:19 get error from resetTransport grpc: the connection is closing, transportMonitor returning
INFO: 2017/10/06 20:03:19 clientv3/balancer: 10.138.0.21:2379 is healthy
INFO: 2017/10/06 20:03:19 clientv3/balancer: pin 10.138.0.21:2379
WARNING: 2017/10/06 20:03:19 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:03:19 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:03:19 Failed to dial 10.138.0.23:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:03:19 Failed to dial 10.138.0.23:2379: grpc: the connection is closing; please retry.
WRITING...
INFO: 2017/10/06 20:03:32 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:03:32 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...
INFO: 2017/10/06 20:03:45 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:03:45 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...
INFO: 2017/10/06 20:03:58 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
INFO: 2017/10/06 20:03:58 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.21:2379)
context deadline exceeded
WRITING...

GET

INFO: 2017/10/06 20:10:24 clientv3/balancer: pin 10.138.0.23:2379
INFO: 2017/10/06 20:10:24 clientv3/balancer: 10.138.0.21:2379 is up but not pinned (already pinned 10.138.0.23:2379)
WARNING: 2017/10/06 20:10:24 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:10:24 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
CREATING GET!

READING...
success!
READING...
success!
READING...
success!
READING...
INFO: 2017/10/06 20:11:01 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.23:2379)
INFO: 2017/10/06 20:11:01 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.23:2379)
context deadline exceeded
READING...
INFO: 2017/10/06 20:11:09 Client received GoAway with http2.ErrCodeEnhanceYourCalm.
INFO: 2017/10/06 20:11:09 clientv3/balancer: unpin 10.138.0.23:2379 (grpc: failed with network I/O error)
INFO: 2017/10/06 20:11:09 clientv3/balancer: 10.138.0.23:2379 is healthy
INFO: 2017/10/06 20:11:09 clientv3/balancer: pin 10.138.0.23:2379
WARNING: 2017/10/06 20:11:09 Failed to dial 10.138.0.21:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:11:09 Failed to dial 10.138.0.21:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:11:09 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:11:09 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
INFO: 2017/10/06 20:11:14 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.23:2379)
INFO: 2017/10/06 20:11:14 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.23:2379)
context deadline exceeded
READING...
INFO: 2017/10/06 20:11:27 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.23:2379)
INFO: 2017/10/06 20:11:27 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.23:2379)
context deadline exceeded
READING...
INFO: 2017/10/06 20:11:40 clientv3/retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.23:2379)
INFO: 2017/10/06 20:11:40 clientv3/auth-retry: retry for error rpc error: code = DeadlineExceeded desc = context deadline exceeded (10.138.0.23:2379)
context deadline exceeded

GET with require leader

INFO: 2017/10/06 20:17:38 clientv3/balancer: pin 10.138.0.21:2379
INFO: 2017/10/06 20:17:38 clientv3/balancer: 10.138.0.23:2379 is up but not pinned (already pinned 10.138.0.21:2379)
WARNING: 2017/10/06 20:17:38 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
WARNING: 2017/10/06 20:17:38 Failed to dial 10.138.0.22:2379: grpc: the connection is closing; please retry.
CREATING GET!

READING...
success!
READING...
success!
READING...
success!
READING...
INFO: 2017/10/06 20:18:10 clientv3/retry: retry for error rpc error: code = Unavailable desc = etcdserver: no leader (10.138.0.21:2379)
INFO: 2017/10/06 20:18:10 clientv3/balancer: unpin 10.138.0.21:2379 (grpc: the connection is drained)
INFO: 2017/10/06 20:18:10 clientv3/auth-retry: retry for error rpc error: code = Unavailable desc = etcdserver: no leader (10.138.0.21:2379)
INFO: 2017/10/06 20:18:10 clientv3/balancer: pin 10.138.0.22:2379
INFO: 2017/10/06 20:18:10 clientv3/balancer: 10.138.0.23:2379 is up but not pinned (already pinned 10.138.0.22:2379)
INFO: 2017/10/06 20:18:10 clientv3/balancer: 10.138.0.21:2379 is healthy
INFO: 2017/10/06 20:18:10 clientv3/balancer: 10.138.0.21:2379 is up but not pinned (already pinned 10.138.0.22:2379)
success!
READING...

@xiang90
Copy link
Contributor

xiang90 commented Oct 7, 2017

good analysis. thanks!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants