Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

Cortx-33875: Confd containers restart during cluster restart #2078

Merged
merged 5 commits into from
Aug 17, 2022

Conversation

nkommuri
Copy link

Problem Statement

  • Cortx-33875 : During cluster restart, confd container hits and assert and restarts.

Design

  • For Bug, During confd startup, it tries to establish a connection with all other services in the cluster. While establishing a session with remote a service [In m0_rpc_item_send()], it observed M0_NC_FAILED state for that service in conf cache, and called m0_rpc_item_failed(), which eventually called m0_rpc_session_establish_reply_received().
    m0_rpc_session_establish_reply_received() expects the session state to be M0_RPC_SESSION_ESTABLISHING, but it is not. session actually is in M0_RPC_SESSION_INITIALISED state and hence the assert. session will be set to M0_RPC_SESSION_ESTABLISHING after m0_rpc__fop_post() call, but it hit the assert before completion of the call.

In m0_rpc_session_establish_reply_received(), it is possible that session can still be in M0_RPC_SESSION_INITIALISED state due to M0_NC_FAILED state in conf cache. motr should handle this situation instead of asserting.

Coding

Checklist for Author

  • Coding conventions are followed and code is consistent

Testing

Checklist for Author

  • Unit and System Tests are added
  • Test Cases cover Happy Path, Non-Happy Path and Scalability
  • Testing was performed with RPM

Impact Analysis

Checklist for Author/Reviewer/GateKeeper

  • Interface change (if any) are documented
  • Side effects on other features (deployment/upgrade)
  • Dependencies on other component(s)

Review Checklist

Checklist for Author

  • JIRA number/GitHub Issue added to PR
  • PR is self reviewed
  • Jira and state/status is updated and JIRA is updated with PR link
  • Check if the description is clear and explained

Documentation

Checklist for Author

  • Changes done to WIKI / Confluence page / Quick Start Guide

Naga Kishore Kommuri added 2 commits August 10, 2022 03:25
Issue: During session establish reply, we expect session state to
be M0_RPC_SESSION_ESTABLISHING and assert otherwise. But, during
m0_rpc_post(), if we decide to cancel the session based on confc obj
status, then session state will still be in M0_RPC_SESSION_INITIALISED.
Converted assert into debug log msg. Session state will be moved to
M0_RPC_SESSION_FAILED, in such case.

Signed-off-by: Naga Kishore Kommuri <nagakishore.kommuri@seagate.com>
as it would have been already called reply received function.

Signed-off-by: Naga Kishore Kommuri <nagakishore.kommuri@seagate.com>
@cla-bot
Copy link

cla-bot bot commented Aug 16, 2022

Thanks for your contribution!
The CLA bot has flagged your contribution as not having a Contributor License Agreement
in place. Note that this is not needed in the overwhelming majority of instances and this warning will usually be ignored.
The code reviewers will make a determination and may ask you to sign a CLA or may choose to ignore this warning.
More information about this can be found here.

state doesn't match with the expected state.

Signed-off-by: Naga Kishore Kommuri <nagakishore.kommuri@seagate.com>
@cla-bot
Copy link

cla-bot bot commented Aug 16, 2022

Thanks for your contribution!
The CLA bot has flagged your contribution as not having a Contributor License Agreement
in place. Note that this is not needed in the overwhelming majority of instances and this warning will usually be ignored.
The code reviewers will make a determination and may ask you to sign a CLA or may choose to ignore this warning.
More information about this can be found here.

@rkothiya
Copy link
Contributor

retest this please

@cla-bot
Copy link

cla-bot bot commented Aug 16, 2022

Thanks for your contribution!
The CLA bot has flagged your contribution as not having a Contributor License Agreement
in place. Note that this is not needed in the overwhelming majority of instances and this warning will usually be ignored.
The code reviewers will make a determination and may ask you to sign a CLA or may choose to ignore this warning.
More information about this can be found here.

@rkothiya
Copy link
Contributor

Jenkins CI Result : Motr#1594

Motr Test Summary

Test ResultCountInfo
❌Failed2
📁

04motr-single-node/49motr-rpc-cancel
01motr-single-node/00userspace-tests

🏁Skipped32
📁

01motr-single-node/28sys-kvs
01motr-single-node/35m0singlenode
01motr-single-node/04initscripts
01motr-single-node/37protocol
02motr-single-node/51kem
02motr-single-node/20rpc-session-cancel
02motr-single-node/10pver-assign
02motr-single-node/21fsync-single-node
02motr-single-node/13dgmode-io
02motr-single-node/14poolmach
02motr-single-node/11m0t1fs
02motr-single-node/26motr-user-kernel-tests
02motr-single-node/08spiel
03motr-single-node/06conf
03motr-single-node/36spare-reservation
04motr-single-node/34sns-repair-1n-1f
04motr-single-node/08spiel-sns-repair-quiesce
04motr-single-node/28sys-kvs-kernel
04motr-single-node/11m0t1fs-rconfc-fail
04motr-single-node/08spiel-sns-repair
04motr-single-node/19sns-repair-abort
04motr-single-node/22sns-repair-ios-fail
05motr-single-node/18sns-repair-quiesce
05motr-single-node/12fwait
05motr-single-node/16sns-repair-multi
05motr-single-node/07mount-fail
05motr-single-node/15sns-repair-single
05motr-single-node/23sns-abort-quiesce
05motr-single-node/17sns-repair-concurrent-io
05motr-single-node/07mount
05motr-single-node/07mount-multiple
05motr-single-node/12fsync

✔️Passed41
📁

01motr-single-node/43m0crate
01motr-single-node/05confgen
01motr-single-node/06hagen
01motr-single-node/52motr-singlenode-sanity
01motr-single-node/01net
01motr-single-node/01kernel-tests
01motr-single-node/03console
01motr-single-node/02rpcping
02motr-single-node/07m0d-fatal
02motr-single-node/67fdmi-plugin-multi-filters
02motr-single-node/53clusterusage-alert
02motr-single-node/41motr-conf-update
03motr-single-node/61sns-repair-motr-1n-1f
03motr-single-node/72spiel-sns-motr-repair-quiesce
03motr-single-node/08spiel-multi-confd
03motr-single-node/69sns-repair-motr-quiesce
03motr-single-node/62sns-repair-motr-mf
03motr-single-node/70sns-failure-after-repair-quiesce
03motr-single-node/63sns-repair-motr-1k-1f
03motr-single-node/60sns-repair-motr-1f
03motr-single-node/66sns-repair-motr-abort-quiesce
03motr-single-node/24motr-dix-repair-lookup-insert-spiel
03motr-single-node/68sns-repair-motr-shutdown
03motr-single-node/64sns-repair-motr-ios-fail
03motr-single-node/71spiel-sns-motr-repair
03motr-single-node/24motr-dix-repair-lookup-insert-m0repair
03motr-single-node/04sss
03motr-single-node/65sns-repair-motr-abort
04motr-single-node/48motr-raid0-io
04motr-single-node/25m0kv
04motr-single-node/44motr-rm-lock-cc-io
04motr-single-node/45motr-rmw
05motr-single-node/23dix-repair-m0repair
05motr-single-node/43motr-sync-replication
05motr-single-node/42motr-utils
05motr-single-node/45motr-sns-repair-N-1
05motr-single-node/40motr-dgmode
05motr-single-node/23dix-repair-quiesce-m0repair
05motr-single-node/23spiel-dix-repair-quiesce
05motr-single-node/44motr-sns-repair
05motr-single-node/23spiel-dix-repair

Total75🔗

CppCheck Summary

   Cppcheck: No new warnings found 👍

@cla-bot
Copy link

cla-bot bot commented Aug 17, 2022

Thanks for your contribution!
The CLA bot has flagged your contribution as not having a Contributor License Agreement
in place. Note that this is not needed in the overwhelming majority of instances and this warning will usually be ignored.
The code reviewers will make a determination and may ask you to sign a CLA or may choose to ignore this warning.
More information about this can be found here.

@rkothiya rkothiya changed the title 33875 Cortx-33875: Confd containers restart during cluster restart Aug 17, 2022
@rkothiya rkothiya merged commit b65638c into Seagate:main Aug 17, 2022
kiwionly2 pushed a commit to kiwionly2/cortx-motr that referenced this pull request Aug 30, 2022
…#2078)

Problem: 
Confd containers restart during cluster restart

Solution : 
During session establish reply, we expect session state to
be M0_RPC_SESSION_ESTABLISHING and assert otherwise. But, during
m0_rpc_post(), if we decide to cancel the session based on confc obj
status, then session state will still be in M0_RPC_SESSION_INITIALISED.
Converted assert into debug log msg. Session state will be moved to
M0_RPC_SESSION_FAILED, in such case.

No need to call session_failed() if m0_rpc__fop_post() returns failure
as it would have been already called reply received function.

Converted DEBUG msg to ERROR and logging only if session's
state doesn't match with the expected state.

Signed-off-by: Naga Kishore Kommuri <nagakishore.kommuri@seagate.com>
@nkommuri nkommuri deleted the 33875 branch September 14, 2022 08:43
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants