Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Non deterministic behaviour of libp2p-kad-dht in TestStableNetworkRPC #3720

Closed
P1sar opened this issue Jan 22, 2024 · 2 comments · Fixed by #3756
Closed

Non deterministic behaviour of libp2p-kad-dht in TestStableNetworkRPC #3720

P1sar opened this issue Jan 22, 2024 · 2 comments · Fixed by #3756
Assignees
Labels
A-debug issue requires detective debug work to figure out what's going wrong. S-network issues related to the dot/network package. S-tests issue related to adding new tests. T-bug this issue covers unexpected and/or wrong behaviour.

Comments

@P1sar
Copy link
Member

P1sar commented Jan 22, 2024

Describe the bug

  • While fixing bug Flaky RPC test "TestStableNetworkRPC" #3616 another unexpected non deterministic behaviour was found. This happens much rarely than one mentioned before but never the less this might be a symptom of something wrong.
    Apparently related to lilbp2p DHT

Expected Behavior

  • Test TestStableNetworkRPC should deterministically work

Current Behavior

  • Test TestStableNetworkRP un deterministically fail with next error;

Possible Solution

  • It might be related to libp2p Kademilia

To Reproduce

Steps to reproduce the behaviour:

  1. run MODE=rpc go test ./tests/rpc/system_integration_test.go -v
  2. it might be necessary to run this several time since it is un deterministic problem

Log output

Log Output
=== RUN   TestStableNetworkRPC
  system_integration_test.go:53: Node alice-0 is ready
  system_integration_test.go:53: Node bob-1 is ready
  node.go:233: node alice-0 crashed: alice-0 encountered a runtime error: exit status 2
      command: /go/src/github.com/ChainSafe/gossamer/bin/gossamer --base-path /tmp/TestStableNetworkRPC3368486598/001 --chain /go/src/github.com/ChainSafe/gossamer/chain/westend-dev/westend-dev-spec-raw.json --role full --no-telemetry
      
      /tmp/TestStableNetworkRPC3368486598/001
      
      
      Logs:
      2024-01-22T12:23:07Z INFO     🕸️ initialising node with name Gossamer, id gssmr, base path /tmp/TestStableNetworkRPC3368486598/001 and chain-spec /tmp/TestStableNetworkRPC3368486598/001/chain-spec-raw.json...	pkg=dot
      2024-01-22T12:23:08Z INFO     block state hash genesis hash: 0x276bfa91f70859348285599321ea96afd3ae681f0be47d36196bac8075ea32e8	pkg=state
      2024-01-22T12:23:08Z INFO     node initialised with name Gossamer, id gssmr, base path /tmp/TestStableNetworkRPC3368486598/001, chain-spec /tmp/TestStableNetworkRPC3368486598/001/chain-spec-raw.json, block 0 and genesis hash 0x276bfa91f70859348285599321ea96afd3ae681f0be47d36196bac8075ea32e8	pkg=dot
      2024-01-22T12:23:08Z INFO     🕸️ initialising node services with global configuration name Gossamer, id gssmr and base path /tmp/TestStableNetworkRPC3368486598/001...	pkg=dot
      2024-01-22T12:23:08Z INFO     created state service with head 0x276bfa91f70859348285599321ea96afd3ae681f0be47d36196bac8075ea32e8, highest number 0 and genesis hash 0x276bfa91f70859348285599321ea96afd3ae681f0be47d36196bac8075ea32e8	pkg=state
      2024-01-22T12:23:08Z INFO     Generating p2p identity with seed 0 and key file /tmp/TestStableNetworkRPC3368486598/001/node.key	pkg=network
      2024-01-22T12:23:08Z WARN     Bootstrap is enabled but no bootstrap nodes are defined	pkg=network
      2024-01-22T12:23:11Z INFO     creating runtime with interpreter wazero...	pkg=dot
      2024-01-22T12:23:11Z INFO     registered notifications sub-protocol /276bfa91f70859348285599321ea96afd3ae681f0be47d36196bac8075ea32e8/grandpa/1	pkg=network
      2024-01-22T12:23:11Z INFO     creating BABE service as authority...	pkg=dot
      2024-01-22T12:23:11Z INFO     keystore with keys [0xc000443a50]	pkg=dot
      2024-01-22T12:23:11Z INFO     creating rpc service with host localhost, external=false, port 8540, modules system,author,chain,state,rpc,grandpa,offchain,childstate,syncstate,payment, ws port 8546 and ws external=true	pkg=dot
      2024-01-22T12:23:11Z INFO     starting node Gossamer...	pkg=cmd
      2024-01-22T12:23:11Z INFO     🕸️ starting node services...	pkg=dot
      2024-01-22T12:23:11Z INFO     Starting services: [*system.Service *network.Service *digest.Handler *core.Service *grandpa.Service *sync.Service *babe.Service *rpc.HTTPServer *state.Service]	pkg=dot,services
      2024-01-22T12:23:11Z INFO     registered notifications sub-protocol /dot/block-announces/1	pkg=network
      2024-01-22T12:23:11Z INFO     registered notifications sub-protocol /dot/transactions/1	pkg=network
      2024-01-22T12:23:11Z INFO     Started listening on /ip4/127.0.0.1/tcp/7004/p2p/12D3KooWRkzukwr3U9e4xcfTpMUfuejYPvXCESr8cf1tZ5xWyrTv	pkg=network
      2024-01-22T12:23:11Z INFO     Started listening on /ip4/65.109.106.232/tcp/7004/p2p/12D3KooWRkzukwr3U9e4xcfTpMUfuejYPvXCESr8cf1tZ5xWyrTv	pkg=network
      2024-01-22T12:23:12Z INFO     started network service with supported protocols /ipfs/ping/1.0.0, /ipfs/id/1.0.0, /ipfs/id/push/1.0.0, /276bfa91f70859348285599321ea96afd3ae681f0be47d36196bac8075ea32e8/grandpa/1, /dot/sync/2, /dot/light/2, /dot/block-announces/1, /dot/transactions/1	pkg=network
      2024-01-22T12:23:12Z INFO     Starting HTTP Server on host localhost and port 8540...	pkg=rpc
      2024-01-22T12:23:12Z INFO     Starting WebSocket Server on host localhost and port 8546...	pkg=rpc
      2024-01-22T12:23:12Z INFO     initiating epoch 0 with start slot 284321032	pkg=babe
      2024-01-22T12:23:12Z WARN     target=runtime::inclusion-inherent message=ParentBlockRandomness did not provide entropy	ext_logging_log_version_1	pkg=runtime module=wazero
      2024-01-22T12:23:12Z WARN     target=runtime::inclusion-inherent message=ParentBlockRandomness did not provide entropy	ext_logging_log_version_1	pkg=runtime module=wazero
      2024-01-22T12:23:12Z WARN     target=runtime::inclusion-inherent message=ParentBlockRandomness did not provide entropy	ext_logging_log_version_1	pkg=runtime module=wazero
      2024-01-22T12:23:16Z INFO     built block 1 with hash 0x6a00c9595bd6df4f42278a5596b385aa41bbcff070815d09e5200fbadf176062, state root 0x012fe83ae98751a3522bfbfe3a[271](https://github.com/ChainSafe/gossamer/actions/runs/7610987482/job/20725498644?pr=3718#step:4:272)d15206511bb12e4ce2389fd21ba763c2713, epoch 0 and slot [284](https://github.com/ChainSafe/gossamer/actions/runs/7610987482/job/20725498644?pr=3718#step:4:285)321032	pkg=babe
      panic: runtime error: invalid memory address or nil pointer dereference
      [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0xf98afc]
      
      goroutine 1043 [running]:
      github.com/libp2p/go-libp2p-kad-dht/dual.(*DHT).FindPeer.func3()
      	github.com/libp2p/go-libp2p-kad-dht@v0.25.2/dual/dual.go:271 +0x7c
      created by github.com/libp2p/go-libp2p-kad-dht/dual.(*DHT).FindPeer in goroutine 451
      	github.com/libp2p/go-libp2p-kad-dht@v0.25.2/dual/dual.go:269 +0x38b
  system_integration_test.go:53: failed to start node charlie-2: failed waiting: node did not start: failed after 5 tries during 5s (context canceled)
  system_integration_test.go:53: Node bob-1 terminated
panic: test timed out after 10m0s
paste log output...

Specification

  • go version:
  • gossamer version:
  • gossamer commit tag:
  • gossamer commit hash:
  • operating system:
  • additional links:
@P1sar P1sar added T-bug this issue covers unexpected and/or wrong behaviour. S-network issues related to the dot/network package. A-debug issue requires detective debug work to figure out what's going wrong. S-tests issue related to adding new tests. labels Jan 22, 2024
@P1sar P1sar self-assigned this Jan 22, 2024
@P1sar
Copy link
Member Author

P1sar commented Jan 31, 2024

Based on my investigation this is what happening:
When Network services starts it is running peer discovery asynchronously here
There is not communication channel with this goroutine and pretty often during the test this peer discovery is not discovring any peers, so NetworkService.host.discovery.dht is nill which leads to nill pointer panic here. Apparently this happens when other peers know about the peer and send it messages.

Generally speaking I think that there is no obvious solutions for this problem. Some are:

  1. making sure that NetworkService.host.discovery.dht is always set to empty DHT this will prevent panic and reject message processing, although it is hard to tell the real consequences of not processing messages.
  2. Problem is somewhere deeper in DHT peer search and we should maybe solve this as root case.
  3. Maybe problem happening (make sense) only locally, one possible way to overcome it is to set list of bootnodes, since they all are starting locally we can hardcode list of bootnodes

Looking forward for your advice

Copy link

github-actions bot commented Mar 1, 2024

🎉 This issue has been resolved in version 0.9.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
A-debug issue requires detective debug work to figure out what's going wrong. S-network issues related to the dot/network package. S-tests issue related to adding new tests. T-bug this issue covers unexpected and/or wrong behaviour.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant