-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Update Solr out of memory threshold to fix false-positive alerts #5063
Comments
We need the We do can make some improvements on the Terraform script. We can relax the threshold a little bit and make the |
I agree with @FuhuXia, the alert is a trigger to keep Solr functional. The alerts don't require human intervention. |
As Fuhu said, we do need this Cloudwatch threshold to restart Solr workers that are going to run out of memory. (Restarts are done by an AWS Lambda that triggers off an AWS SNS topic https://github.com/GSA-TTS/datagov-brokerpak-solr/blob/main/terraform/ecs/provision/restarts.tf#L5 using code that is templated in by the brokerpak terraform https://github.com/GSA-TTS/datagov-brokerpak-solr/blob/main/terraform/ecs/provision/app_template.tf) Slack notifications are performed by that same lambda function, under control of a cloud.gov service configuration variable called "slackNotification" https://github.com/GSA-TTS/datagov-brokerpak-solr/blob/main/solr-on-ecs.yml#L83. The webhook URL that notification posts to is in an AWS SecretsManager item called We can't see what terraform actions the brokerpak will take for a given |
From what I remember, you are right @neilmb, the CSB doesn't support any terraform As a general sketch,
In the rare case that it is destructive, the only downside would be some downtime and the harvester just needs to be turned off while it's restarting/recreating resources. The restart lambda is completely separate from the ECS definition which leads me to have confidence that it is safe. But you can also make the change in the In terms of how to change the terraform code, it sounds like the code already supports all of the features, but if it doesn't, let me know if there's anything specific that's needed. |
GSA-TTS/datagov-brokerpak-solr#99 is on the way towards being able to do this. |
Our Solr cluster alerts regularly as being "out of memory" at a threshold of 25000, but then seems to recover very soon after. These "false-positive" alerts are a drain on limited cognitive resources.
How to reproduce
Look in your email inbox for messages titled "ALARM: "Solr--Follower--MemoryThreshold""
Expected behavior
When the Solr cluster is running normally, I expect to get no email alerts.
Actual behavior
I get an email alert about a condition that does not appear to be any sort of problem because it resolves itself.
Sketch
The threshold for these alerts is set in https://github.com/GSA-TTS/datagov-brokerpak-solr/blob/main/terraform/ecs/provision/cloudwatch.tf#L45 and we could raise it there, then make a new release of that brokerpak, bump the version number of that brokerpak in datagov-ssb, and apply those changes to our actual running infrastructure.
The text was updated successfully, but these errors were encountered: