-
Notifications
You must be signed in to change notification settings - Fork 55
fix: Avoid race when removing interfaces via NNCP #2347
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
WalkthroughThe changes introduce two new methods in the Changes
Suggested labels
Suggested reviewers
Tip ⚡🧪 Multi-step agentic review comment chat (experimental)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
Report bugs in Issues The following are automatically added:
Available user actions:
Supported /retest check runs
Supported labels
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
ocp_resources/node_network_configuration_policy.py (2)
466-467
: Consider extracting the date format as a constant.The date format string "%Y-%m-%dT%H:%M:%SZ" could be extracted as a class constant for better maintainability and reusability.
+ class NodeNetworkConfigurationPolicy(Resource): + api_group = Resource.ApiGroup.NMSTATE_IO + DATE_FORMAT = "%Y-%m-%dT%H:%M:%SZ" ... def _wait_for_nncp_with_different_transition_time(self, initial_transition_time): - date_format = "%Y-%m-%dT%H:%M:%SZ" for condition in self.instance.get("status", {}).get("conditions", []): if ( condition and condition["type"] == NodeNetworkConfigurationPolicy.Conditions.Type.AVAILABLE - and datetime.strptime(condition["lastTransitionTime"], date_format) - > datetime.strptime(initial_transition_time, date_format) + and datetime.strptime(condition["lastTransitionTime"], self.DATE_FORMAT) + > datetime.strptime(initial_transition_time, self.DATE_FORMAT) ):
452-460
: Add null check for initial_transition_time.The current implementation assumes
_get_nncp_configured_last_transition_time()
will always return a value, but if no condition matches, it could return None, potentially causing issues in the update method.def update(self, resource_dict=None): initial_transition_time = self._get_nncp_configured_last_transition_time() super().update(resource_dict=resource_dict) - self._wait_for_nncp_with_different_transition_time(initial_transition_time=initial_transition_time) + if initial_transition_time: + self._wait_for_nncp_with_different_transition_time(initial_transition_time=initial_transition_time) + else: + self.logger.warning(f"No initial transition time found for NNCP {self.name}, skipping transition time check")Also applies to: 477-480
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
ocp_resources/node_network_configuration_policy.py
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: conventional-title
- GitHub Check: conventional-title
- GitHub Check: conventional-title
- GitHub Check: conventional-title
- GitHub Check: conventional-title
- GitHub Check: python-module-install
- GitHub Check: tox
- GitHub Check: conventional-title
🔇 Additional comments (4)
ocp_resources/node_network_configuration_policy.py (4)
2-2
: Appropriate import additions to support the new functionality.The added imports are necessary for the new timing mechanism to handle race conditions:
datetime
for timestamp processingTIMEOUT_1MINUTE, TIMEOUT_5SEC
constants for retry configurationretry
for implementing the retry mechanismAlso applies to: 6-6, 14-14
452-460
: Well-implemented helper method to retrieve last transition time.This helper method identifies when a configuration has been successfully applied by checking that:
- The condition type is "Available"
- The status is "True"
- The reason is "SuccessfullyConfigured"
This provides a robust way to determine when configuration was last successfully applied.
461-475
: Good implementation of time-based verification with retry mechanism.The retry pattern with timeout handling effectively addresses the race condition described in the PR by:
- Using proper datetime parsing and comparison
- Implementing appropriate timeouts (1 minute with 5-second intervals)
- Clearly checking that the transition time has changed since the initial value
477-480
: This change successfully addresses the race condition.The updated method now properly:
- Captures the initial transition time before executing the update
- Invokes the parent class update method
- Waits for a different transition time, ensuring the NNCP's status reflects the most recent change
This implementation should effectively prevent premature deletion of the NNCP before configuration changes are fully applied, addressing the lingering interfaces issue mentioned in the PR.
/assign-reviewer @EdDev |
not adding reviewer EdDev by user comment, EdDev is not part of contributers |
/verified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
ocp_resources/node_network_configuration_policy.py (2)
326-328
: Fix typos in the comments.There are minor spelling errors in the comments.
- # The time-stamp that is returned by _get_last_successful_transition_time() will chnage after the call to - # update(), therefore it must be fetched and stored before, and comapred with the new time-stamp after. + # The time-stamp that is returned by _get_last_successful_transition_time() will change after the call to + # update(), therefore it must be fetched and stored before, and compared with the new time-stamp after. initial_success_status_time = self._get_last_successful_transition_time()
348-361
: Add status check in the condition evaluation.The method checks for conditions with type AVAILABLE and newer timestamps, but doesn't verify the status is "True". This could potentially match an AVAILABLE condition with status "False" or "Unknown", which might not accurately represent a successful update.
for condition in self.instance.get("status", {}).get("conditions", []): if ( condition["type"] == self.Conditions.Type.AVAILABLE + and condition["status"] == Resource.Condition.Status.TRUE and datetime.strptime(condition["lastTransitionTime"], date_format) > formatted_initial_transition_time ): return True
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
ocp_resources/node_network_configuration_policy.py
(2 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
ocp_resources/node_network_configuration_policy.py (2)
ocp_resources/node_network_state.py (1) (1)
NodeNetworkState
(12-112)ocp_resources/resource.py (8) (8)
Resource
(301-1292)update
(908-922)update
(1451-1498)instance
(1007-1018)instance
(1380-1391)status
(896-906)get
(961-1004)get
(1327-1377)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: conventional-title
- GitHub Check: conventional-title
- GitHub Check: conventional-title
- GitHub Check: conventional-title
- GitHub Check: conventional-title
- GitHub Check: conventional-title
- GitHub Check: tox
- GitHub Check: python-module-install
- GitHub Check: conventional-title
🔇 Additional comments (6)
ocp_resources/node_network_configuration_policy.py (6)
2-2
: LGTM: Import addition for the datetime functionality.The datetime module is appropriately added to support the timestamp comparison functionality needed for fixing the race condition.
6-6
: LGTM: Required constant imports added.The
TIMEOUT_1MINUTE
andTIMEOUT_5SEC
constants are correctly imported to support the retry mechanism in the new methods.
14-14
: LGTM: Added retry import.The
retry
decorator is appropriately imported to implement the timeout and retry mechanism for the NNCP status verification.
333-336
: LGTM: Appropriate check for NNCP status update.The code correctly ensures that NNCP status updates are tracked only when there was a valid initial status time. The comment explains the rationale clearly - if the NNCP failed on setup, it wouldn't be in AVAILABLE status initially, and the next time it becomes AVAILABLE will necessarily have a new timestamp.
338-346
: LGTM: Well-implemented method to get the last successful transition time.This method correctly retrieves the lastTransitionTime of a condition that meets specific criteria (Available type, True status, and SuccessfullyConfigured reason). The proper return type annotation
str | None
helps clarify that this method might not find a matching condition.
348-352
: LGTM: Appropriate retry mechanism.The retry decorator with a 1-minute timeout and 5-second sleep interval is a good choice for this operation, allowing sufficient time for the status to update while not waiting unnecessarily long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
ocp_resources/node_network_configuration_policy.py (1)
347-362
: Effective implementation of status update detectionThe retry mechanism with appropriate timeouts ensures reliable detection of status changes. The timestamp comparison logic correctly identifies when a new AVAILABLE status has been created after an update operation.
A small suggestion for better maintainability:
Consider extracting the date format string into a constant at the class level to avoid magic strings:
class NodeNetworkConfigurationPolicy(Resource): api_group = Resource.ApiGroup.NMSTATE_IO + DATE_FORMAT = "%Y-%m-%dT%H:%M:%SZ" # Then use self.DATE_FORMAT in the method def _wait_for_nncp_status_update(self, initial_transition_time: str) -> bool: - date_format = "%Y-%m-%dT%H:%M:%SZ" - formatted_initial_transition_time = datetime.strptime(initial_transition_time, date_format) + formatted_initial_transition_time = datetime.strptime(initial_transition_time, self.DATE_FORMAT)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
ocp_resources/node_network_configuration_policy.py
(2 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
ocp_resources/node_network_configuration_policy.py (2)
ocp_resources/node_network_state.py (1) (1)
NodeNetworkState
(12-112)ocp_resources/resource.py (8) (8)
Resource
(301-1292)update
(908-922)update
(1451-1498)instance
(1007-1018)instance
(1380-1391)status
(896-906)get
(961-1004)get
(1327-1377)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: conventional-title
- GitHub Check: conventional-title
- GitHub Check: tox
- GitHub Check: python-module-install
- GitHub Check: conventional-title
🔇 Additional comments (3)
ocp_resources/node_network_configuration_policy.py (3)
2-2
: Import changes look goodThe added imports for
datetime
, additional timeout constants, andretry
decorator support the new functionality to detect NNCP state changes.Also applies to: 6-6, 14-14
326-336
: Well-structured fix for the race conditionThe added code correctly addresses the race condition when removing interfaces by comparing timestamps. The comments clearly explain the purpose of storing the initial timestamp before the update and checking for a new timestamp afterward.
The conditional check at line 334 properly handles the case when no initial success status time is available, avoiding potential errors.
337-346
: Good implementation of the transition time retrievalThis method correctly handles finding the last successful transition time from the NNCP conditions, with proper type annotation and null handling.
/verified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.
Removing an interface that was created using an NNCP, is done by editing the same NNCP. This sometimes resulted in a race, in which the NNCP success status actually presented the prvious status, leading to deleting the NNCP before the configuration was completed, leaving hanging interfaces in the cluster nodes, with node native interfaces occupied as the ports of these tests-created interfaces. A recent PR made this failed flow to always occur. This PR aims to assure that the timestamp of the AVAIALBLE status is updated for the recent change (the interface removal) and not the previous change (setup or modification). This PR is based on the fix that was presented in RedHatQE/openshift-virtualization-tests#430.
Removing an interface that was created using an NNCP, is done by editing the same NNCP. This sometimes resulted in a race, in which the NNCP success status actually presented the prvious status, leading to deleting the NNCP before the configuration was completed, leaving hanging interfaces in the cluster nodes, with node native interfaces occupied as the ports of these tests-created interfaces. A recent PR made this failed flow to always occur. This PR aims to assure that the timestamp of the AVAIALBLE status is updated for the recent change (the interface removal) and not the previous change (setup or modification). This PR is based on the fix that was presented in RedHatQE/openshift-virtualization-tests#430.
Remoivng an interface that was created using an NNCP, is done by editing the same NNCP. This sometimes resulted in a race, in which the NNCP success status actually presented the prvious status, leading to deleting the NNCP before the configuration was completed, leaving hanging interfaces in the cluster nodes, with node native interfaces occupied as the ports of these tests-created interfaces. A recent PR made this failed flow to always occur. This PR aims to assure that the timestamp of the AVAIALBLE status is updated for the recent change (the interface removal) and not the previous change (setup or modification).
This PR is based on the fix that was presented in
RedHatQE/openshift-virtualization-tests#430.
Short description:
More details:
What this PR does / why we need it:
Which issue(s) this PR fixes:
Special notes for reviewer:
Bug:
Summary by CodeRabbit
datetime
function from certain checks.