Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Update Solr out of memory threshold to fix false-positive alerts #5063

Open
neilmb opened this issue Feb 4, 2025 · 5 comments
Open

Update Solr out of memory threshold to fix false-positive alerts #5063

neilmb opened this issue Feb 4, 2025 · 5 comments
Assignees
Labels
bug Software defect or bug

Comments

@neilmb
Copy link
Contributor

neilmb commented Feb 4, 2025

Our Solr cluster alerts regularly as being "out of memory" at a threshold of 25000, but then seems to recover very soon after. These "false-positive" alerts are a drain on limited cognitive resources.

How to reproduce

Look in your email inbox for messages titled "ALARM: "Solr--Follower--MemoryThreshold""

Expected behavior

When the Solr cluster is running normally, I expect to get no email alerts.

Actual behavior

I get an email alert about a condition that does not appear to be any sort of problem because it resolves itself.

Sketch

The threshold for these alerts is set in https://github.com/GSA-TTS/datagov-brokerpak-solr/blob/main/terraform/ecs/provision/cloudwatch.tf#L45 and we could raise it there, then make a new release of that brokerpak, bump the version number of that brokerpak in datagov-ssb, and apply those changes to our actual running infrastructure.

@neilmb neilmb added the bug Software defect or bug label Feb 4, 2025
@neilmb neilmb self-assigned this Feb 4, 2025
@neilmb neilmb moved this to 📥 Queue in data.gov team board Feb 4, 2025
@FuhuXia
Copy link
Member

FuhuXia commented Feb 4, 2025

We need the 25000 threshold. There is a memory leak somewhere in the Solr ECS setup. Without the threshold and auto restart mechanism, the mem usage will keep going up and exhaust the 28000 very soon. Then you will need manual intervention to bring the Solr back.

Image

We do can make some improvements on the Terraform script. We can relax the threshold a little bit and make the 27000, buying us more time between restarts. We can also set a random threshold between 25000 and 27000 so that all Solr instance do not restart at the same time. We can stop sending the alerts to slack channel (email is enough) since Solr restart does not cause any problem any more.

@nickumia
Copy link

nickumia commented Feb 9, 2025

I agree with @FuhuXia, the alert is a trigger to keep Solr functional. The alerts don't require human intervention.

@neilmb neilmb moved this from 📥 Queue to 🏗 In Progress [8] in data.gov team board Feb 14, 2025
@neilmb
Copy link
Contributor Author

neilmb commented Feb 14, 2025

As Fuhu said, we do need this Cloudwatch threshold to restart Solr workers that are going to run out of memory. (Restarts are done by an AWS Lambda that triggers off an AWS SNS topic https://github.com/GSA-TTS/datagov-brokerpak-solr/blob/main/terraform/ecs/provision/restarts.tf#L5 using code that is templated in by the brokerpak terraform https://github.com/GSA-TTS/datagov-brokerpak-solr/blob/main/terraform/ecs/provision/app_template.tf)

Slack notifications are performed by that same lambda function, under control of a cloud.gov service configuration variable called "slackNotification" https://github.com/GSA-TTS/datagov-brokerpak-solr/blob/main/solr-on-ecs.yml#L83. The webhook URL that notification posts to is in an AWS SecretsManager item called slackSolrEventNotificationUrl. If we want to change the destination for the Slack notification, we need to change the value of that secret and then convince the broker to run the terraform to update the value of that secret in the lambda.

We can't see what terraform actions the brokerpak will take for a given cf update-service so we need to test in development while observing the effects in AWS to be sure it is doing what we want.

@nickumia
Copy link

From what I remember, you are right @neilmb, the CSB doesn't support any terraform plan-ning capability. However, if you wanted to, you can launch a standalone version of the terraform code and make the terraform variable input changes and see what the difference in the plan would be and if it would be destructive to other resources. My gut tells me it should be a safe change, but instructions for testing this change are outlined in Practical Learning > WORKFLOW 1 of

As a general sketch,

  • Bring up Solr using terrform apply
  • Make changes to terraform code
  • Do a terraform plan to see the changes
  • Make a decision on if the changes require special handling

In the rare case that it is destructive, the only downside would be some downtime and the harvester just needs to be turned off while it's restarting/recreating resources. The restart lambda is completely separate from the ECS definition which leads me to have confidence that it is safe. But you can also make the change in the development-ssb space and then see what the logs in the broker has. It doesn't give you a pre-plan, but it (should) tell you the output of the terraform code and then you can see what's changed.

In terms of how to change the terraform code, it sounds like the code already supports all of the features, but if it doesn't, let me know if there's anything specific that's needed.

@neilmb neilmb moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Mar 10, 2025
@neilmb
Copy link
Contributor Author

neilmb commented Mar 10, 2025

GSA-TTS/datagov-brokerpak-solr#99 is on the way towards being able to do this.

@neilmb neilmb moved this from 👀 Needs Review [2] to 🏗 In Progress [8] in data.gov team board Mar 12, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Software defect or bug
Projects
Status: 🏗 In Progress [8]
Development

No branches or pull requests

3 participants