Skip to content

fix: Avoid race when removing interfaces via NNCP #2347

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 11 commits into from
Mar 19, 2025

Conversation

yossisegev
Copy link
Contributor

@yossisegev yossisegev commented Mar 17, 2025

Remoivng an interface that was created using an NNCP, is done by editing the same NNCP. This sometimes resulted in a race, in which the NNCP success status actually presented the prvious status, leading to deleting the NNCP before the configuration was completed, leaving hanging interfaces in the cluster nodes, with node native interfaces occupied as the ports of these tests-created interfaces. A recent PR made this failed flow to always occur. This PR aims to assure that the timestamp of the AVAIALBLE status is updated for the recent change (the interface removal) and not the previous change (setup or modification).
This PR is based on the fix that was presented in
RedHatQE/openshift-virtualization-tests#430.

Short description:
More details:
What this PR does / why we need it:
Which issue(s) this PR fixes:
Special notes for reviewer:
Bug:

Summary by CodeRabbit

  • New Features
    • Improved network configuration updates now include an added verification step that ensures changes are fully applied, enhancing reliability during updates.
  • Chores
    • Updated Flake8 configuration to exclude the datetime function from certain checks.

Copy link

coderabbitai bot commented Mar 17, 2025

Walkthrough

The changes introduce two new methods in the NodeNetworkConfigurationPolicy class: _get_last_successful_transition_time and _wait_for_nncp_status_update. The first method retrieves the last transition time of the configuration when its status is "Available" and the reason is "SuccessfullyConfigured." The second method implements a retry mechanism to check for changes in the transition time. The apply method is modified to utilize these new methods, and import statements are updated to support the new functionality.

Changes

File Summary
ocp_resources/.../node_network_configuration_policy.py Added _get_last_successful_transition_time and _wait_for_nncp_status_update methods; updated the apply method to capture the initial transition time and verify it changes post-update; updated import statements.
.flake8 Added datetime to the fcn_exclude_functions list to exclude it from certain checks.

Suggested labels

size/L, verified, can-be-merged, approved-rnetser

Suggested reviewers

  • dbasunag
  • rnetser

Tip

⚡🧪 Multi-step agentic review comment chat (experimental)
  • We're introducing multi-step agentic chat in review comments. This experimental feature enhances review discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments.
    - To enable this feature, set early_access to true under in the settings.
✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@redhat-qe-bot
Copy link
Contributor

Report bugs in Issues

The following are automatically added:

  • Add reviewers from OWNER file (in the root of the repository) under reviewers section.
  • Set PR size label.
  • New issue is created for the PR. (Closed when PR is merged/closed)
  • Run pre-commit if .pre-commit-config.yaml exists in the repo.

Available user actions:

  • To mark PR as WIP comment /wip to the PR, To remove it from the PR comment /wip cancel to the PR.
  • To block merging of PR comment /hold, To un-block merging of PR comment /hold cancel.
  • To mark PR as verified comment /verified to the PR, to un-verify comment /verified cancel to the PR.
    verified label removed on each new commit push.
  • To cherry pick a merged PR comment /cherry-pick <target branch to cherry-pick to> in the PR.
    • Multiple target branches can be cherry-picked, separated by spaces. (/cherry-pick branch1 branch2)
    • Cherry-pick will be started when PR is merged
  • To build and push container image command /build-and-push-container in the PR (tag will be the PR number).
    • You can add extra args to the Podman build command
      • Example: /build-and-push-container --build-arg OPENSHIFT_PYTHON_WRAPPER_COMMIT=<commit_hash>
  • To add a label by comment use /<label name>, to remove, use /<label name> cancel
  • To assign reviewers based on OWNERS file use /assign-reviewers
  • To check if PR can be merged use /check-can-merge
  • to assign reviewer to PR use /assign-reviewer @<reviewer>
Supported /retest check runs
  • /retest tox: Retest tox
  • /retest python-module-install: Retest python-module-install
  • /retest conventional-title: Retest pre-commit
  • /retest all: Retest all
Supported labels
  • hold
  • verified
  • wip
  • lgtm

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
ocp_resources/node_network_configuration_policy.py (2)

466-467: Consider extracting the date format as a constant.

The date format string "%Y-%m-%dT%H:%M:%SZ" could be extracted as a class constant for better maintainability and reusability.

+ class NodeNetworkConfigurationPolicy(Resource):
+     api_group = Resource.ApiGroup.NMSTATE_IO
+     DATE_FORMAT = "%Y-%m-%dT%H:%M:%SZ"

...

  def _wait_for_nncp_with_different_transition_time(self, initial_transition_time):
-     date_format = "%Y-%m-%dT%H:%M:%SZ"
      for condition in self.instance.get("status", {}).get("conditions", []):
          if (
              condition
              and condition["type"] == NodeNetworkConfigurationPolicy.Conditions.Type.AVAILABLE
-             and datetime.strptime(condition["lastTransitionTime"], date_format)
-             > datetime.strptime(initial_transition_time, date_format)
+             and datetime.strptime(condition["lastTransitionTime"], self.DATE_FORMAT)
+             > datetime.strptime(initial_transition_time, self.DATE_FORMAT)
          ):

452-460: Add null check for initial_transition_time.

The current implementation assumes _get_nncp_configured_last_transition_time() will always return a value, but if no condition matches, it could return None, potentially causing issues in the update method.

  def update(self, resource_dict=None):
      initial_transition_time = self._get_nncp_configured_last_transition_time()
      super().update(resource_dict=resource_dict)
-     self._wait_for_nncp_with_different_transition_time(initial_transition_time=initial_transition_time)
+     if initial_transition_time:
+         self._wait_for_nncp_with_different_transition_time(initial_transition_time=initial_transition_time)
+     else:
+         self.logger.warning(f"No initial transition time found for NNCP {self.name}, skipping transition time check")

Also applies to: 477-480

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 981333c and 2953317.

📒 Files selected for processing (1)
  • ocp_resources/node_network_configuration_policy.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: conventional-title
  • GitHub Check: conventional-title
  • GitHub Check: conventional-title
  • GitHub Check: conventional-title
  • GitHub Check: conventional-title
  • GitHub Check: python-module-install
  • GitHub Check: tox
  • GitHub Check: conventional-title
🔇 Additional comments (4)
ocp_resources/node_network_configuration_policy.py (4)

2-2: Appropriate import additions to support the new functionality.

The added imports are necessary for the new timing mechanism to handle race conditions:

  • datetime for timestamp processing
  • TIMEOUT_1MINUTE, TIMEOUT_5SEC constants for retry configuration
  • retry for implementing the retry mechanism

Also applies to: 6-6, 14-14


452-460: Well-implemented helper method to retrieve last transition time.

This helper method identifies when a configuration has been successfully applied by checking that:

  1. The condition type is "Available"
  2. The status is "True"
  3. The reason is "SuccessfullyConfigured"

This provides a robust way to determine when configuration was last successfully applied.


461-475: Good implementation of time-based verification with retry mechanism.

The retry pattern with timeout handling effectively addresses the race condition described in the PR by:

  1. Using proper datetime parsing and comparison
  2. Implementing appropriate timeouts (1 minute with 5-second intervals)
  3. Clearly checking that the transition time has changed since the initial value

477-480: This change successfully addresses the race condition.

The updated method now properly:

  1. Captures the initial transition time before executing the update
  2. Invokes the parent class update method
  3. Waits for a different transition time, ensuring the NNCP's status reflects the most recent change

This implementation should effectively prevent premature deletion of the NNCP before configuration changes are fully applied, addressing the lingering interfaces issue mentioned in the PR.

@yossisegev
Copy link
Contributor Author

/assign-reviewer @EdDev

@redhat-qe-bot1
Copy link

not adding reviewer EdDev by user comment, EdDev is not part of contributers

@yossisegev
Copy link
Contributor Author

/verified

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
ocp_resources/node_network_configuration_policy.py (2)

326-328: Fix typos in the comments.

There are minor spelling errors in the comments.

-        # The time-stamp that is returned by _get_last_successful_transition_time() will chnage after the call to
-        # update(), therefore it must be fetched and stored before, and comapred with the new time-stamp after.
+        # The time-stamp that is returned by _get_last_successful_transition_time() will change after the call to
+        # update(), therefore it must be fetched and stored before, and compared with the new time-stamp after.
         initial_success_status_time = self._get_last_successful_transition_time()

348-361: Add status check in the condition evaluation.

The method checks for conditions with type AVAILABLE and newer timestamps, but doesn't verify the status is "True". This could potentially match an AVAILABLE condition with status "False" or "Unknown", which might not accurately represent a successful update.

        for condition in self.instance.get("status", {}).get("conditions", []):
            if (
                condition["type"] == self.Conditions.Type.AVAILABLE
+               and condition["status"] == Resource.Condition.Status.TRUE
                and datetime.strptime(condition["lastTransitionTime"], date_format) > formatted_initial_transition_time
            ):
                return True
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9706dcc and 477df8f.

📒 Files selected for processing (1)
  • ocp_resources/node_network_configuration_policy.py (2 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
ocp_resources/node_network_configuration_policy.py (2)
ocp_resources/node_network_state.py (1) (1)
  • NodeNetworkState (12-112)
ocp_resources/resource.py (8) (8)
  • Resource (301-1292)
  • update (908-922)
  • update (1451-1498)
  • instance (1007-1018)
  • instance (1380-1391)
  • status (896-906)
  • get (961-1004)
  • get (1327-1377)
⏰ Context from checks skipped due to timeout of 90000ms (9)
  • GitHub Check: conventional-title
  • GitHub Check: conventional-title
  • GitHub Check: conventional-title
  • GitHub Check: conventional-title
  • GitHub Check: conventional-title
  • GitHub Check: conventional-title
  • GitHub Check: tox
  • GitHub Check: python-module-install
  • GitHub Check: conventional-title
🔇 Additional comments (6)
ocp_resources/node_network_configuration_policy.py (6)

2-2: LGTM: Import addition for the datetime functionality.

The datetime module is appropriately added to support the timestamp comparison functionality needed for fixing the race condition.


6-6: LGTM: Required constant imports added.

The TIMEOUT_1MINUTE and TIMEOUT_5SEC constants are correctly imported to support the retry mechanism in the new methods.


14-14: LGTM: Added retry import.

The retry decorator is appropriately imported to implement the timeout and retry mechanism for the NNCP status verification.


333-336: LGTM: Appropriate check for NNCP status update.

The code correctly ensures that NNCP status updates are tracked only when there was a valid initial status time. The comment explains the rationale clearly - if the NNCP failed on setup, it wouldn't be in AVAILABLE status initially, and the next time it becomes AVAILABLE will necessarily have a new timestamp.


338-346: LGTM: Well-implemented method to get the last successful transition time.

This method correctly retrieves the lastTransitionTime of a condition that meets specific criteria (Available type, True status, and SuccessfullyConfigured reason). The proper return type annotation str | None helps clarify that this method might not find a matching condition.


348-352: LGTM: Appropriate retry mechanism.

The retry decorator with a 1-minute timeout and 5-second sleep interval is a good choice for this operation, allowing sufficient time for the status to update while not waiting unnecessarily long.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
ocp_resources/node_network_configuration_policy.py (1)

347-362: Effective implementation of status update detection

The retry mechanism with appropriate timeouts ensures reliable detection of status changes. The timestamp comparison logic correctly identifies when a new AVAILABLE status has been created after an update operation.

A small suggestion for better maintainability:

Consider extracting the date format string into a constant at the class level to avoid magic strings:

class NodeNetworkConfigurationPolicy(Resource):
    api_group = Resource.ApiGroup.NMSTATE_IO
+    DATE_FORMAT = "%Y-%m-%dT%H:%M:%SZ"

    # Then use self.DATE_FORMAT in the method
    def _wait_for_nncp_status_update(self, initial_transition_time: str) -> bool:
-        date_format = "%Y-%m-%dT%H:%M:%SZ"
-        formatted_initial_transition_time = datetime.strptime(initial_transition_time, date_format)
+        formatted_initial_transition_time = datetime.strptime(initial_transition_time, self.DATE_FORMAT)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 477df8f and b8fc779.

📒 Files selected for processing (1)
  • ocp_resources/node_network_configuration_policy.py (2 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
ocp_resources/node_network_configuration_policy.py (2)
ocp_resources/node_network_state.py (1) (1)
  • NodeNetworkState (12-112)
ocp_resources/resource.py (8) (8)
  • Resource (301-1292)
  • update (908-922)
  • update (1451-1498)
  • instance (1007-1018)
  • instance (1380-1391)
  • status (896-906)
  • get (961-1004)
  • get (1327-1377)
⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: conventional-title
  • GitHub Check: conventional-title
  • GitHub Check: tox
  • GitHub Check: python-module-install
  • GitHub Check: conventional-title
🔇 Additional comments (3)
ocp_resources/node_network_configuration_policy.py (3)

2-2: Import changes look good

The added imports for datetime, additional timeout constants, and retry decorator support the new functionality to detect NNCP state changes.

Also applies to: 6-6, 14-14


326-336: Well-structured fix for the race condition

The added code correctly addresses the race condition when removing interfaces by comparing timestamps. The comments clearly explain the purpose of storing the initial timestamp before the update and checking for a new timestamp afterward.

The conditional check at line 334 properly handles the case when no initial success status time is available, avoiding potential errors.


337-346: Good implementation of the transition time retrieval

This method correctly handles finding the last successful transition time from the NNCP conditions, with proper type annotation and null handling.

@yossisegev
Copy link
Contributor Author

/verified

Copy link
Contributor

@sbahar619 sbahar619 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

@myakove myakove merged commit 065ffde into RedHatQE:main Mar 19, 2025
5 of 7 checks passed
yossisegev added a commit to yossisegev/openshift-python-wrapper that referenced this pull request Mar 19, 2025
Removing an interface that was created using an NNCP, is done by editing the same
NNCP. This sometimes resulted in a race, in which the NNCP success status actually
presented the prvious status, leading to deleting the NNCP before the configuration
was completed, leaving hanging interfaces in the cluster nodes, with
node native interfaces occupied as the ports of these tests-created interfaces.
A recent PR made this failed flow to always occur.
This PR aims to assure that the timestamp of the AVAIALBLE status is updated for
the recent change (the interface removal) and not the previous change (setup or
modification).
This PR is based on the fix that was presented in
RedHatQE/openshift-virtualization-tests#430.
myakove pushed a commit that referenced this pull request Mar 19, 2025
Removing an interface that was created using an NNCP, is done by editing the same
NNCP. This sometimes resulted in a race, in which the NNCP success status actually
presented the prvious status, leading to deleting the NNCP before the configuration
was completed, leaving hanging interfaces in the cluster nodes, with
node native interfaces occupied as the ports of these tests-created interfaces.
A recent PR made this failed flow to always occur.
This PR aims to assure that the timestamp of the AVAIALBLE status is updated for
the recent change (the interface removal) and not the previous change (setup or
modification).
This PR is based on the fix that was presented in
RedHatQE/openshift-virtualization-tests#430.
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants