Update Solr out of memory threshold to fix false-positive alerts #5063

neilmb · 2025-02-04T18:25:52Z

Our Solr cluster alerts regularly as being "out of memory" at a threshold of 25000, but then seems to recover very soon after. These "false-positive" alerts are a drain on limited cognitive resources.

How to reproduce

Look in your email inbox for messages titled "ALARM: "Solr--Follower--MemoryThreshold""

Expected behavior

When the Solr cluster is running normally, I expect to get no email alerts.

Actual behavior

I get an email alert about a condition that does not appear to be any sort of problem because it resolves itself.

Sketch

The threshold for these alerts is set in https://github.com/GSA-TTS/datagov-brokerpak-solr/blob/main/terraform/ecs/provision/cloudwatch.tf#L45 and we could raise it there, then make a new release of that brokerpak, bump the version number of that brokerpak in datagov-ssb, and apply those changes to our actual running infrastructure.

FuhuXia · 2025-02-04T21:39:24Z

We need the 25000 threshold. There is a memory leak somewhere in the Solr ECS setup. Without the threshold and auto restart mechanism, the mem usage will keep going up and exhaust the 28000 very soon. Then you will need manual intervention to bring the Solr back.

We do can make some improvements on the Terraform script. We can relax the threshold a little bit and make the 27000, buying us more time between restarts. We can also set a random threshold between 25000 and 27000 so that all Solr instance do not restart at the same time. We can stop sending the alerts to slack channel (email is enough) since Solr restart does not cause any problem any more.

nickumia · 2025-02-09T04:51:18Z

I agree with @FuhuXia, the alert is a trigger to keep Solr functional. The alerts don't require human intervention.

neilmb · 2025-02-14T17:11:50Z

As Fuhu said, we do need this Cloudwatch threshold to restart Solr workers that are going to run out of memory. (Restarts are done by an AWS Lambda that triggers off an AWS SNS topic https://github.com/GSA-TTS/datagov-brokerpak-solr/blob/main/terraform/ecs/provision/restarts.tf#L5 using code that is templated in by the brokerpak terraform https://github.com/GSA-TTS/datagov-brokerpak-solr/blob/main/terraform/ecs/provision/app_template.tf)

Slack notifications are performed by that same lambda function, under control of a cloud.gov service configuration variable called "slackNotification" https://github.com/GSA-TTS/datagov-brokerpak-solr/blob/main/solr-on-ecs.yml#L83. The webhook URL that notification posts to is in an AWS SecretsManager item called slackSolrEventNotificationUrl. If we want to change the destination for the Slack notification, we need to change the value of that secret and then convince the broker to run the terraform to update the value of that secret in the lambda.

We can't see what terraform actions the brokerpak will take for a given cf update-service so we need to test in development while observing the effects in AWS to be sure it is doing what we want.

nickumia · 2025-02-14T21:22:03Z

From what I remember, you are right @neilmb, the CSB doesn't support any terraform plan-ning capability. However, if you wanted to, you can launch a standalone version of the terraform code and make the terraform variable input changes and see what the difference in the plan would be and if it would be destructive to other resources. My gut tells me it should be a safe change, but instructions for testing this change are outlined in Practical Learning > WORKFLOW 1 of

Learn How to Operate in the Brokerpak/Broker/SSB World #4483

As a general sketch,

Bring up Solr using terrform apply
Make changes to terraform code
Do a terraform plan to see the changes
Make a decision on if the changes require special handling

In the rare case that it is destructive, the only downside would be some downtime and the harvester just needs to be turned off while it's restarting/recreating resources. The restart lambda is completely separate from the ECS definition which leads me to have confidence that it is safe. But you can also make the change in the development-ssb space and then see what the logs in the broker has. It doesn't give you a pre-plan, but it (should) tell you the output of the terraform code and then you can see what's changed.

In terms of how to change the terraform code, it sounds like the code already supports all of the features, but if it doesn't, let me know if there's anything specific that's needed.

neilmb · 2025-03-10T16:31:01Z

GSA-TTS/datagov-brokerpak-solr#99 is on the way towards being able to do this.

neilmb · 2025-03-21T19:40:58Z

For a service set up with slackNotification: false using cf update-service <name> -c '{"slackNotification": true}' did not change the Lambda code that does slack notifications to add the notification code. Unfortunately, further research is needed to figure out if we can update brokered resources like this or not.

neilmb · 2025-03-21T19:43:22Z

Next step: Go back to the datagov-brokerpak-solr, run the CSB in Docker inside that repo and try out this API method https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#updating-a-service-instance and watch what happens on the backend with the Terraform code to understand what gets called.

neilmb · 2025-03-28T19:38:10Z

It has been good to learn about the brokers, really important, but this particular ticket is a red herring: The slack "out of memory" notifications are actually coming from an email integration set up manually in our SNS topics to send emails into our alerts channel using a special datagov-alerts-...@gsa.org.slack.com email address.

The slackNotification code in AWS Lambda has as far as I can tell never worked. The slack_sdk python module that it is trying to use to post notifications from the Lambda does not get installed and this error occurs in our CloudWatch logs for the lambda when it is set up with slackNotification: true.

[ERROR] ModuleNotFoundError: No module named 'slack_sdk'
Traceback (most recent call last):
  File "/var/task/app.py", line 21, in handler
    notifySlack(message_json, service_dimensions['ClusterName'], service_dimensions['ServiceName'])
  File "/var/task/app.py", line 54, in notifySlack
    from slack_sdk.webhook import WebhookClient

neilmb · 2025-03-28T20:00:28Z

Documentation and a proposed restructuring of the management spaces in https://docs.google.com/document/d/1ydoq9BTfg9oz7-nrZkBxCWcBn9IAS11WzneovxhyA-8/edit?tab=t.0

neilmb · 2025-03-28T20:03:29Z

For the actual notification issue, we can remove the SNS email subscriptions to stop those messages showing up in our alerts channel. Then I propose removing the slackNotification code entirely from the brokerpak and testing and deploying that brokerpak code all the way up to management and updating all of the existing services across development, staging, and prod spaces.

neilmb added the bug label Feb 4, 2025

neilmb self-assigned this Feb 4, 2025

neilmb moved this to 📥 Queue in data.gov team board Feb 4, 2025

neilmb added this to data.gov team board Feb 4, 2025

neilmb moved this from 📥 Queue to 🏗 In Progress [8] in data.gov team board Feb 14, 2025

neilmb moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Mar 10, 2025

neilmb moved this from 👀 Needs Review [2] to 🏗 In Progress [8] in data.gov team board Mar 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Solr out of memory threshold to fix false-positive alerts #5063

Update Solr out of memory threshold to fix false-positive alerts #5063

neilmb commented Feb 4, 2025

FuhuXia commented Feb 4, 2025

nickumia commented Feb 9, 2025

neilmb commented Feb 14, 2025

nickumia commented Feb 14, 2025

neilmb commented Mar 10, 2025

neilmb commented Mar 21, 2025

neilmb commented Mar 21, 2025

neilmb commented Mar 28, 2025

neilmb commented Mar 28, 2025

neilmb commented Mar 28, 2025

Update Solr out of memory threshold to fix false-positive alerts #5063

Update Solr out of memory threshold to fix false-positive alerts #5063

Comments

neilmb commented Feb 4, 2025

How to reproduce

Expected behavior

Actual behavior

Sketch

FuhuXia commented Feb 4, 2025

nickumia commented Feb 9, 2025

neilmb commented Feb 14, 2025

nickumia commented Feb 14, 2025

neilmb commented Mar 10, 2025

neilmb commented Mar 21, 2025

neilmb commented Mar 21, 2025

neilmb commented Mar 28, 2025

neilmb commented Mar 28, 2025

neilmb commented Mar 28, 2025