Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Update Solr out of memory threshold to fix false-positive alerts #5063

Open
neilmb opened this issue Feb 4, 2025 · 10 comments
Open

Update Solr out of memory threshold to fix false-positive alerts #5063

neilmb opened this issue Feb 4, 2025 · 10 comments
Assignees
Labels
bug Software defect or bug

Comments

@neilmb
Copy link
Contributor

neilmb commented Feb 4, 2025

Our Solr cluster alerts regularly as being "out of memory" at a threshold of 25000, but then seems to recover very soon after. These "false-positive" alerts are a drain on limited cognitive resources.

How to reproduce

Look in your email inbox for messages titled "ALARM: "Solr--Follower--MemoryThreshold""

Expected behavior

When the Solr cluster is running normally, I expect to get no email alerts.

Actual behavior

I get an email alert about a condition that does not appear to be any sort of problem because it resolves itself.

Sketch

The threshold for these alerts is set in https://github.com/GSA-TTS/datagov-brokerpak-solr/blob/main/terraform/ecs/provision/cloudwatch.tf#L45 and we could raise it there, then make a new release of that brokerpak, bump the version number of that brokerpak in datagov-ssb, and apply those changes to our actual running infrastructure.

@neilmb neilmb added the bug Software defect or bug label Feb 4, 2025
@neilmb neilmb self-assigned this Feb 4, 2025
@neilmb neilmb moved this to 📥 Queue in data.gov team board Feb 4, 2025
@FuhuXia
Copy link
Member

FuhuXia commented Feb 4, 2025

We need the 25000 threshold. There is a memory leak somewhere in the Solr ECS setup. Without the threshold and auto restart mechanism, the mem usage will keep going up and exhaust the 28000 very soon. Then you will need manual intervention to bring the Solr back.

Image

We do can make some improvements on the Terraform script. We can relax the threshold a little bit and make the 27000, buying us more time between restarts. We can also set a random threshold between 25000 and 27000 so that all Solr instance do not restart at the same time. We can stop sending the alerts to slack channel (email is enough) since Solr restart does not cause any problem any more.

@nickumia
Copy link

nickumia commented Feb 9, 2025

I agree with @FuhuXia, the alert is a trigger to keep Solr functional. The alerts don't require human intervention.

@neilmb neilmb moved this from 📥 Queue to 🏗 In Progress [8] in data.gov team board Feb 14, 2025
@neilmb
Copy link
Contributor Author

neilmb commented Feb 14, 2025

As Fuhu said, we do need this Cloudwatch threshold to restart Solr workers that are going to run out of memory. (Restarts are done by an AWS Lambda that triggers off an AWS SNS topic https://github.com/GSA-TTS/datagov-brokerpak-solr/blob/main/terraform/ecs/provision/restarts.tf#L5 using code that is templated in by the brokerpak terraform https://github.com/GSA-TTS/datagov-brokerpak-solr/blob/main/terraform/ecs/provision/app_template.tf)

Slack notifications are performed by that same lambda function, under control of a cloud.gov service configuration variable called "slackNotification" https://github.com/GSA-TTS/datagov-brokerpak-solr/blob/main/solr-on-ecs.yml#L83. The webhook URL that notification posts to is in an AWS SecretsManager item called slackSolrEventNotificationUrl. If we want to change the destination for the Slack notification, we need to change the value of that secret and then convince the broker to run the terraform to update the value of that secret in the lambda.

We can't see what terraform actions the brokerpak will take for a given cf update-service so we need to test in development while observing the effects in AWS to be sure it is doing what we want.

@nickumia
Copy link

From what I remember, you are right @neilmb, the CSB doesn't support any terraform plan-ning capability. However, if you wanted to, you can launch a standalone version of the terraform code and make the terraform variable input changes and see what the difference in the plan would be and if it would be destructive to other resources. My gut tells me it should be a safe change, but instructions for testing this change are outlined in Practical Learning > WORKFLOW 1 of

As a general sketch,

  • Bring up Solr using terrform apply
  • Make changes to terraform code
  • Do a terraform plan to see the changes
  • Make a decision on if the changes require special handling

In the rare case that it is destructive, the only downside would be some downtime and the harvester just needs to be turned off while it's restarting/recreating resources. The restart lambda is completely separate from the ECS definition which leads me to have confidence that it is safe. But you can also make the change in the development-ssb space and then see what the logs in the broker has. It doesn't give you a pre-plan, but it (should) tell you the output of the terraform code and then you can see what's changed.

In terms of how to change the terraform code, it sounds like the code already supports all of the features, but if it doesn't, let me know if there's anything specific that's needed.

Sorry, something went wrong.

@neilmb neilmb moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Mar 10, 2025
@neilmb
Copy link
Contributor Author

neilmb commented Mar 10, 2025

GSA-TTS/datagov-brokerpak-solr#99 is on the way towards being able to do this.

@neilmb neilmb moved this from 👀 Needs Review [2] to 🏗 In Progress [8] in data.gov team board Mar 12, 2025
@neilmb
Copy link
Contributor Author

neilmb commented Mar 21, 2025

For a service set up with slackNotification: false using cf update-service <name> -c '{"slackNotification": true}' did not change the Lambda code that does slack notifications to add the notification code. Unfortunately, further research is needed to figure out if we can update brokered resources like this or not.

@neilmb
Copy link
Contributor Author

neilmb commented Mar 21, 2025

Next step: Go back to the datagov-brokerpak-solr, run the CSB in Docker inside that repo and try out this API method https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#updating-a-service-instance and watch what happens on the backend with the Terraform code to understand what gets called.

@neilmb
Copy link
Contributor Author

neilmb commented Mar 28, 2025

It has been good to learn about the brokers, really important, but this particular ticket is a red herring: The slack "out of memory" notifications are actually coming from an email integration set up manually in our SNS topics to send emails into our alerts channel using a special datagov-alerts-...@gsa.org.slack.com email address.

The slackNotification code in AWS Lambda has as far as I can tell never worked. The slack_sdk python module that it is trying to use to post notifications from the Lambda does not get installed and this error occurs in our CloudWatch logs for the lambda when it is set up with slackNotification: true.

[ERROR] ModuleNotFoundError: No module named 'slack_sdk'
Traceback (most recent call last):
  File "/var/task/app.py", line 21, in handler
    notifySlack(message_json, service_dimensions['ClusterName'], service_dimensions['ServiceName'])
  File "/var/task/app.py", line 54, in notifySlack
    from slack_sdk.webhook import WebhookClient

@neilmb
Copy link
Contributor Author

neilmb commented Mar 28, 2025

Documentation and a proposed restructuring of the management spaces in https://docs.google.com/document/d/1ydoq9BTfg9oz7-nrZkBxCWcBn9IAS11WzneovxhyA-8/edit?tab=t.0

@neilmb
Copy link
Contributor Author

neilmb commented Mar 28, 2025

For the actual notification issue, we can remove the SNS email subscriptions to stop those messages showing up in our alerts channel. Then I propose removing the slackNotification code entirely from the brokerpak and testing and deploying that brokerpak code all the way up to management and updating all of the existing services across development, staging, and prod spaces.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Software defect or bug
Projects
Status: 🏗 In Progress [8]
Development

No branches or pull requests

3 participants