Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Alertmanager: Initialize skipped Grafana Alertmanagers receiving requests #10691

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

santihernandezc
Copy link
Contributor

@santihernandezc santihernandezc commented Feb 19, 2025

Description

If the grafana_alertmanager_conditionally_skip_tenant_suffix option is configured and a Grafana Alertmanager tenant doesn't have a promoted, non-default, non-empty configuration, we skip initializing it.

The problem is that clients making requests against an uninitialized Alertmanager get 406s. Unless a Grafana Alertmanager has a "usable" configuration, users won't be able to test templates, test and get receivers, create silences, etc.

This PR makes the multi-tenant Alertmanager start per-tenant Grafana Alertmanagers on incoming requests. This way, requests can be handled even if tenants were initially skipped.

It also adds a grace period for idle Alertmanagers. Whenever a skipped Alertmanager gets a request, we start the Alertmanager and keep a timestamp indicating when this request was received. After the grace period elapses, we shut down the Alertmanager.

Testing

I tested this PR by spinning up two Alertmanagers (read-write mode) with:

  • multitenancy_enabled: true
  • grafana_alertmanager_compatibility_enabled: true
  • grafana_alertmanager_conditionally_skip_tenant_suffix: -grafana

I then created 200 Alertmanager tenants with empty configurations:

  • 100 of them not matching the configured suffix and using an empty configuration
  • 99 of them matching the suffix and using an empty configurations
  • 1 of them matching the suffix and using a promoted, non-default, non-empty config

The 99 tenants with empty configuration and matching the suffix were initially skipped. I then sent test alerts for each tenant matching the suffix. Alertmanagers for each of them were started.

After the default grace period passed, all Alertmanagers for tenants matching the configured suffix were stopped, except the one using a promoted, non-default, non-empty configuration.

Query for tenants skipped per instance

Screenshot 2025-02-20 at 5 20 16 PM

Query for active Alertmanagers by type (Mimir/Grafana)

Screenshot 2025-02-20 at 5 23 46 PM

Copy link
Contributor

github-actions bot commented Feb 20, 2025

@@ -104,7 +104,6 @@ type Config struct {
PersisterConfig PersisterConfig

GrafanaAlertmanagerCompatibility bool
GrafanaAlertmanagerTenantSuffix string
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated fix, this was not being used here.

@santihernandezc santihernandezc changed the title (WIP) Alertmanager: Initialize skipped Grafana Alertmanagers receiving requests Alertmanager: Initialize skipped Grafana Alertmanagers receiving requests Feb 20, 2025
@santihernandezc santihernandezc marked this pull request as ready for review February 20, 2025 16:32
@santihernandezc santihernandezc requested review from a team and tacole02 as code owners February 20, 2025 16:33
Copy link
Contributor

@tacole02 tacole02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs look great! Thank you!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants