Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

MS maintenance improvements #10417

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sureshanaparti
Copy link
Contributor

Description

This PR addresses the following improvements during MS maintenance

  • Sends 503 (Service Unavailable) response status when maintenance or shutdown is initiated
    [Any load balancer in the clustered environment can avoid routing requests to this MS node]
  • Migrates systemvm agents before routing host agents
  • Updates last agents (using the msid)
  • Added events for maintenance and shutdown operations
  • Some code improvements

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Manually tested the changes.

(cmk) 🐱 > list managementservers filter=uuid,name,state,pendingjobscount,agentscount,
count = 3
managementserver:
+------+-------------------------------------------------------------+-------------+------------------+-------------+
| UUID |                            NAME                             |    STATE    | PENDINGJOBSCOUNT | AGENTSCOUNT |
+------+-------------------------------------------------------------+-------------+------------------+-------------+
|      | ref-trl-7940-k-m7-suresh-anaparti-mgmt1 | Maintenance |                0 |           0 |
|      | ref-trl-7940-k-m7-suresh-anaparti-mgmt2 | Up          |                0 |           8 |
|      | ref-trl-7940-k-m7-suresh-anaparti-mgmt3 | Up          |                0 |           4 |
+------+-------------------------------------------------------------+-------------+------------------+-------------+

(cmk) 🐱 > list managementserversmetrics filter=name,agentcount,agents,lastagents
count = 3
managementserver:
+-------------------------------------------------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                            NAME                             | AGENTCOUNT |                                                                                                                                                          AGENTS                                                                                                                                                           |                                                                          LASTAGENTS                                                                           |
+-------------------------------------------------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ref-trl-7940-k-m7-suresh-anaparti-mgmt1 |          0 | []                                                                                                                                                                                                                                                                                                                        | ["4933c5ad-1b83-43ad-8f84-e3419b235657","8b57630e-6202-4c51-914c-6f4ddce52d79","b671e13f-a5eb-4ed2-b99d-080f3b776d0e","7d476ad8-c03d-4fdb-8c0d-f34f0f172d0b"] |
| ref-trl-7940-k-m7-suresh-anaparti-mgmt2 |          8 | ["12462284-f84f-4b24-befa-4b75d0982015","13a5b8ae-2b4b-4977-91d7-f97adb7db564","e8de2144-efb3-408c-99a8-a2592fe2c7d9","794f8261-0a73-4edc-a829-5764deb266e8","bb832b60-f1e7-4601-b50f-8385636ada99","095d3cb6-51f6-4283-9960-e633b8d72cec","d72a4703-1770-445c-8d83-5193eb39cad7","c93e0bd4-9997-44d1-a730-697c7a11512f"] |                                                                                                                                                               |
| ref-trl-7940-k-m7-suresh-anaparti-mgmt3 |          4 | ["4933c5ad-1b83-43ad-8f84-e3419b235657","8b57630e-6202-4c51-914c-6f4ddce52d79","b671e13f-a5eb-4ed2-b99d-080f3b776d0e","7d476ad8-c03d-4fdb-8c0d-f34f0f172d0b"]                                                                                                                                                             |                                                                                                                                                               |
+-------------------------------------------------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

503 Service Unavailable response =>

(cmd) 🐱 > create volume diskofferingid=b8199316-68a0-4328-b97a-de50a9f65474 zoneid=02302e8a-be46-42b3-91fd-30c59b4a530d 
🙈 Error: (HTTP 503, error code 9999) Maintenance or Shutdown has been initiated on this management server. Can not accept new jobs

Request:
GET /client/api/?zoneid=02302e8a-be46-42b3-91fd-30c59b4a530d&diskofferingid=b8199316-68a0-4328-b97a-de50a9f65474&name=testvol&command=createVolume&response=json&sessionkey=xGzNFCzSuhI1eh-uG9Ps593r2bY HTTP/1.1

Response:
HTTP/1.1 503 Service Unavailable
Content-Type: application/json;charset=utf-8
X-Description: Maintenance or Shutdown has been initiated on this management server. Can not accept new jobs
...

How did you try to break this feature and the system with this change?

@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copy link

codecov bot commented Feb 18, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 3.99%. Comparing base (1c1dad9) to head (9ef1c12).

❗ There is a different number of reports uploaded between BASE (1c1dad9) and HEAD (9ef1c12). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (1c1dad9) HEAD (9ef1c12)
unittests 1 0
Additional details and impacted files
@@              Coverage Diff              @@
##               main   #10417       +/-   ##
=============================================
- Coverage     16.17%    3.99%   -12.18%     
=============================================
  Files          5668      398     -5270     
  Lines        498179    32581   -465598     
  Branches      60290     5776    -54514     
=============================================
- Hits          80581     1302    -79279     
+ Misses       408578    31129   -377449     
+ Partials       9020      150     -8870     
Flag Coverage Δ
uitests 3.99% <ø> (ø)
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12497

@sureshanaparti
Copy link
Contributor Author

@blueorangutan test

@blueorangutan
Copy link

@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-12458)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 51474 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10417-t12458-kvm-ol8.zip
Smoke tests completed. 139 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_11_isolated_network_with_dynamic_routed_mode Error 2.29 test_ipv4_routing.py
test_12_vpc_and_tier_with_dynamic_routed_mode Error 2.40 test_ipv4_routing.py
test_12_vpc_and_tier_with_dynamic_routed_mode Error 2.41 test_ipv4_routing.py
test_06_purge_expunged_vm_background_task Failure 386.49 test_purge_expunged_vms.py

- block new agent connections during prepare for maintenance of ms

- maintain avoids ms list

- propagate updated management servers list and lb algorithm in host and indirect.agent.lb.algorithm settings respectively, to systemvm (non-routing) agents

- updated setup ms list and migrate agent connections to executor service

- migrate agent connection through executor, and send the answer to the ms host that initiated the migration

- re-initialize ssl handshake executor if it is shutdown

- don't allow prepare for maintenance or shutdown when other management server nodes are in preparing states

- don't allow trigger shutdown when management server is up and other management server nodes are in preparing states

- stop agent connections monitor on ms maintenance

- update avoid ms list in ready command

- updated connected host from the client connection

- update last agents in ms metrics from the database

- updated some agent config descriptions

- update last management server in the hosts during shutdown

- added agents and lastagents in management server response

- updated management server maintenance & shutdown unit tests

- some code improvements
@sureshanaparti sureshanaparti force-pushed the ms-maintenance-improvements branch from 0f0f8e7 to 9ef1c12 Compare March 6, 2025 09:39
@sureshanaparti
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12677

@sureshanaparti
Copy link
Contributor Author

@blueorangutan test

@rohityadavcloud
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@rohityadavcloud a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants