Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Race condition in function clusterNodes.Addrs() #3218

Open
shawnwgit opened this issue Jan 5, 2025 · 0 comments · May be fixed by #3219
Open

Race condition in function clusterNodes.Addrs() #3218

shawnwgit opened this issue Jan 5, 2025 · 0 comments · May be fixed by #3219

Comments

@shawnwgit
Copy link

A race condition exists in the redis cluster client involving c.activeAddrs(a slice that stores redis node addresses). Specifically, the cluster client’s GC function writes to c.activeAddrs, while concurrently, callers may invoke the Addrs function to read from it.

Write path

Function GC overwrites the activeAddrs from its beginning.

// GC removes unused nodes.
func (c *clusterNodes) GC(generation uint32) {
	//nolint:prealloc
	var collected []*clusterNode

	c.mu.Lock()

	c.activeAddrs = c.activeAddrs[:0] // <-------------
	for addr, node := range c.nodes {
		if node.Generation() >= generation {
			c.activeAddrs = append(c.activeAddrs, addr) // <-------------
			if c.opt.RouteByLatency {
				go node.updateLatency()
			}
			continue
		}

		delete(c.nodes, addr)
		collected = append(collected, node)
	}

	c.mu.Unlock()

	for _, node := range collected {
		_ = node.Client.Close()
	}
}

Read path

The function Addrs() returns a reference to a string slice. The returned addrs references either c.activeAddrs or c.addrs from the original object.

func (c *clusterNodes) Addrs() ([]string, error) {
	var addrs []string

	c.mu.RLock()
	closed := c.closed //nolint:ifshort
	if !closed {
		if len(c.activeAddrs) > 0 {
			addrs = c.activeAddrs // <-------------
		} else {
			addrs = c.addrs // <-------------
		}
	}
	c.mu.RUnlock()

	if closed {
		return nil, pool.ErrClosed
	}
	if len(addrs) == 0 {
		return nil, errClusterNoNodes
	}
	return addrs, nil // <-------------
}

Expected Behavior

There is no race condition between the "addrs" consumers and the GC function.

Current Behavior

WARNING: DATA RACE
Write at 0x00c00068a5d0 by goroutine 248:
  github.com/redis/go-redis/v9.(*clusterNodes).GC()
      /home/user/work/go-redis-shawnw/osscluster.go:521 +0x284
  github.com/redis/go-redis/v9.newClusterState.func1()
      /home/user/work/go-redis-shawnw/osscluster.go:690 +0x4c

Previous read at 0x00c00068a5d0 by goroutine 247:
  github.com/redis/go-redis/v9.(*ClusterClient).loadState()
      /home/user/work/go-redis-shawnw/osscluster.go:1221 +0x2be
  github.com/redis/go-redis/v9.(*ClusterClient).loadState-fm()
      <autogenerated>:1 +0x47
  github.com/redis/go-redis/v9.(*clusterStateHolder).Reload()
      /home/user/work/go-redis-shawnw/osscluster.go:848 +0x50
  github.com/redis/go-redis/v9.(*clusterStateHolder).LazyReload.func1()
      /home/user/work/go-redis-shawnw/osscluster.go:863 +0x8f

Possible Solution

Return a new copy of the addr slice instead of a reference to the original slice.

Steps to Reproduce

  1. Apply the attached patch: repro.patch
  2. Run the new test in the patch: go test -race -v --timeout 1m -run "TestGinkgoSuite"

Context (Environment)

The race condition has been detected in the real Redis cluster and is also reproducible with the above steps.

Possible Implementation

One possible solution is to modify the func (c *clusterNodes) Addrs() ([]string, error) function to return a copied slice instead of a reference.

func (c *clusterNodes) Addrs() ([]string, error) {
...
	if !closed {
		if len(c.activeAddrs) > 0 {
			addrs = c.activeAddrs  // make a copy
		} else {
			addrs = c.addrs // make a copy
		}
	}

The fix might slightly increase the runtime of the Addrs() function, as copying the original slice generally takes more time than creating a simple reference. However, this impact appears to be negligible. Details are as follows:

  • Benchmarking on an "AMD EPYC 7B13" host shows that copying a slice of 500 addresses takes approximately 2.5 microseconds.
  • The existing callers of the Addr() function, such as ClusterClient.loadState, clusterNodes.Random, and ClusterClient.cmdsInfo, are not on the perf critical path. Therefore, adding a few microseconds is unlikely to have any significant impact.
@shawnwgit shawnwgit linked a pull request Jan 5, 2025 that will close this issue
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant