Manager metrics are inconsistent #2191

mykaul · 2024-02-29T10:04:09Z

v4.6.2, just next to Manager stats, we have a green '100%' without description or anything:

tzach · 2024-02-29T13:12:28Z

Good question

amnonh · 2024-02-29T14:18:03Z

@Michal-Leszczynski The issue we are facing, we don't have a consistent way of knowing when a repair/backup is in progress.
The overview dashboard uses two different metrics each of them returning different value.
What will be a trusted and consistent way of knowing:
a. a repair/backup is in progress
b. present progress

Note that when the repair/backup is done, we will show 0%

Michal-Leszczynski · 2024-02-29T15:34:50Z

a. a repair/backup is in progress

There is a metric scylla_manager_scheduler_run_indicator labeled by cluster ID, task type, task ID which shows 1 for a running task and 0 otherwise (paused tasks are not running). Why can't it be used?

b. present progress

There are some metrics that can be used for progress calculation, although they might be confusing (e.g. token ranges aren't weighted by table size). We can add a general scylla_manager_{task_type}_progress metrics to solve this problem.

amnonh · 2024-02-29T16:35:23Z

back to the previous question, what is the simplest way to know if a repair/back is in progress and how far.

For a future metric, let's see how we can reduce the number of metrics as much as possible and have fewer metrics that tell us what is happening. Thousands of metrics are not helpful.

Michal-Leszczynski · 2024-03-01T10:12:04Z

There is a metric scylla_manager_scheduler_run_indicator labeled by cluster ID, task type, task ID which shows 1 for a running task and 0 otherwise (paused tasks are not running). Why can't it be used?

This can be used to check if a task is running.

In terms of progress:

repair - scylla_manager_repair_progress labeled only by cluster ID.
backup - there is no dedicated progress metric, but we have other metrics that can help with that:

                filesSizeBytes: g("Total size of backup files in bytes.",
			"files_size_bytes", "cluster", "keyspace", "table", "host"),
		filesUploadedBytes: g("Number of bytes uploaded to backup location.",
			"files_uploaded_bytes", "cluster", "keyspace", "table", "host"),
		filesSkippedBytes: g("Number of deduplicated bytes already uploaded to backup location.",
			"files_skipped_bytes", "cluster", "keyspace", "table", "host"),
		filesFailedBytes: g("Number of bytes failed to upload to backup location.",
			"files_failed_bytes", "cluster", "keyspace", "table", "host"),

so progress = (files_uploaded_bytes + files_skipped_bytes + files_failed_bytes) / files_size_bytes

amnonh · 2024-03-03T09:11:56Z

The change that causes the current issue is #2122
scylla_manager_repair_progress shows 100 but no repair is running

@Michal-Leszczynski, The issues around the manager metrics have been hunting us for a few years now; please look seriously at the problem.

The manager creates a lot of metrics no one looks at and lacks the few metrics we need.

I also see that we are using scylla_manager_task_active_count from what version is scylla_manager_scheduler_run_indicator valid?

amnonh · 2024-03-05T11:03:50Z

@Michal-Leszczynski ping, I would like to have it resolved for 4.7 release

Michal-Leszczynski · 2024-03-06T09:35:42Z

The change that causes the current issue is #2122
scylla_manager_repair_progress shows 100 but no repair is running

At the beginning of each task run SM resets (sets to -1) all metrics of given task type from this cluster, but it doesn't reset it at the end of the task run. So the 100% is just a leftover from the previous repair.

I can change it so SM resets task metrics at the end of each task run as well. Paused or failed tasks won't have their metrics reset. Would that be ok?

I also see that we are using scylla_manager_task_active_count from what version is scylla_manager_scheduler_run_indicator valid?

From SM 3.0.

The manager creates a lot of metrics no one looks at and lacks the few metrics we need.

Ref: scylladb/scylla-manager#3732, please answer my question there so that we can decide how to approach this problem.

amnonh · 2024-03-06T10:01:24Z

@Michal-Leszczynski I still don't have an anser. We need to have a solution for the next monitoring release (4.7), manager information is currently broken.

So regardless of future improvements, under what we have, how can we tell that a repair/backup is currently running and their progress.

Michal-Leszczynski · 2024-03-06T10:13:03Z

Currently running task:
scylla_manager_scheduler_run_indicator

Task progress:
#2191 (comment)

Michal-Leszczynski · 2024-03-06T10:14:14Z

Perhaps I don't understand the problem, because I think I answered those question a few times by now?
Why are those answers not useful for this issue?

amnonh · 2024-03-06T10:50:25Z

@Michal-Leszczynski, because it's inconsistent and unclear we have scylla_manager_task_active_count and scylla_manager_scheduler_run_indicator

We don't show task id, we need to know running, not running for repair/backup how exactly do I determine that?

how exactly, do I determine the progress, and showing 100% when nothing is running is not good enough.

Michal-Leszczynski · 2024-03-06T12:57:41Z

we have scylla_manager_task_active_count

Actually, I believe that this metric has been deleted starting with SM 3.0.

We don't show task id, we need to know running, not running for repair/backup how exactly do I determine that?

You can use something like:

is_task_type_running = sum(scylla_manager_scheduler_run_indicator where cluster=ID and type=repair and task=*) > 0

how exactly, do I determine the progress, and showing 100% when nothing is running is not good enough.

You can multiply progress from scylla_manager_repair_progress by is_task_type_running from above. It would take care of the 100% progress when the task is not running - but it would also mean that there is always 0% progress when task is paused. The same goes for backup progress formula described here.

Without any changes to SM, it is not possible to see difference between paused and finished tasks by just looking at the metrics.

mykaul added the bug Something isn't working right label Feb 29, 2024

mykaul assigned amnonh Feb 29, 2024

amnonh changed the title ~~100% is confusing - 100% of what?~~ Manager metrics are inconsistent Feb 29, 2024

amnonh removed their assignment Feb 29, 2024

amnonh assigned Michal-Leszczynski Feb 29, 2024

Michal-Leszczynski mentioned this issue Mar 6, 2024

Aggregated metric issues scylladb/scylla-manager#3744

Open

amnonh mentioned this issue Mar 7, 2024

types.json: my back/repair consistent #2206

Merged

amnonh closed this as completed in #2206 Mar 7, 2024

amnonh added this to the Monitoring 4.7 milestone Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manager metrics are inconsistent #2191

Manager metrics are inconsistent #2191

mykaul commented Feb 29, 2024

tzach commented Feb 29, 2024

amnonh commented Feb 29, 2024

Michal-Leszczynski commented Feb 29, 2024

amnonh commented Feb 29, 2024

Michal-Leszczynski commented Mar 1, 2024

amnonh commented Mar 3, 2024 •

edited

Loading

amnonh commented Mar 5, 2024

Michal-Leszczynski commented Mar 6, 2024 •

edited

Loading

amnonh commented Mar 6, 2024

Michal-Leszczynski commented Mar 6, 2024

Michal-Leszczynski commented Mar 6, 2024

amnonh commented Mar 6, 2024

Michal-Leszczynski commented Mar 6, 2024

Manager metrics are inconsistent #2191

Manager metrics are inconsistent #2191

Comments

mykaul commented Feb 29, 2024

tzach commented Feb 29, 2024

amnonh commented Feb 29, 2024

Michal-Leszczynski commented Feb 29, 2024

amnonh commented Feb 29, 2024

Michal-Leszczynski commented Mar 1, 2024

amnonh commented Mar 3, 2024 • edited Loading

amnonh commented Mar 5, 2024

Michal-Leszczynski commented Mar 6, 2024 • edited Loading

amnonh commented Mar 6, 2024

Michal-Leszczynski commented Mar 6, 2024

Michal-Leszczynski commented Mar 6, 2024

amnonh commented Mar 6, 2024

Michal-Leszczynski commented Mar 6, 2024

amnonh commented Mar 3, 2024 •

edited

Loading

Michal-Leszczynski commented Mar 6, 2024 •

edited

Loading