Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Manager metrics are inconsistent #2191

Closed
Tracked by #3744
mykaul opened this issue Feb 29, 2024 · 13 comments · Fixed by #2206
Closed
Tracked by #3744

Manager metrics are inconsistent #2191

mykaul opened this issue Feb 29, 2024 · 13 comments · Fixed by #2206
Assignees
Labels
bug Something isn't working right

Comments

@mykaul
Copy link
Contributor

mykaul commented Feb 29, 2024

v4.6.2, just next to Manager stats, we have a green '100%' without description or anything:
image

@mykaul mykaul added the bug Something isn't working right label Feb 29, 2024
@tzach
Copy link
Contributor

tzach commented Feb 29, 2024

Good question

@amnonh amnonh changed the title 100% is confusing - 100% of what? Manager metrics are inconsistent Feb 29, 2024
@amnonh amnonh removed their assignment Feb 29, 2024
@amnonh
Copy link
Collaborator

amnonh commented Feb 29, 2024

@Michal-Leszczynski The issue we are facing, we don't have a consistent way of knowing when a repair/backup is in progress.
The overview dashboard uses two different metrics each of them returning different value.
What will be a trusted and consistent way of knowing:
a. a repair/backup is in progress
b. present progress

Note that when the repair/backup is done, we will show 0%

@Michal-Leszczynski
Copy link

a. a repair/backup is in progress

There is a metric scylla_manager_scheduler_run_indicator labeled by cluster ID, task type, task ID which shows 1 for a running task and 0 otherwise (paused tasks are not running). Why can't it be used?

b. present progress

There are some metrics that can be used for progress calculation, although they might be confusing (e.g. token ranges aren't weighted by table size). We can add a general scylla_manager_{task_type}_progress metrics to solve this problem.

@amnonh
Copy link
Collaborator

amnonh commented Feb 29, 2024

back to the previous question, what is the simplest way to know if a repair/back is in progress and how far.

For a future metric, let's see how we can reduce the number of metrics as much as possible and have fewer metrics that tell us what is happening. Thousands of metrics are not helpful.

@Michal-Leszczynski
Copy link

There is a metric scylla_manager_scheduler_run_indicator labeled by cluster ID, task type, task ID which shows 1 for a running task and 0 otherwise (paused tasks are not running). Why can't it be used?

This can be used to check if a task is running.

In terms of progress:

  • repair - scylla_manager_repair_progress labeled only by cluster ID.
  • backup - there is no dedicated progress metric, but we have other metrics that can help with that:
                filesSizeBytes: g("Total size of backup files in bytes.",
			"files_size_bytes", "cluster", "keyspace", "table", "host"),
		filesUploadedBytes: g("Number of bytes uploaded to backup location.",
			"files_uploaded_bytes", "cluster", "keyspace", "table", "host"),
		filesSkippedBytes: g("Number of deduplicated bytes already uploaded to backup location.",
			"files_skipped_bytes", "cluster", "keyspace", "table", "host"),
		filesFailedBytes: g("Number of bytes failed to upload to backup location.",
			"files_failed_bytes", "cluster", "keyspace", "table", "host"),

so progress = (files_uploaded_bytes + files_skipped_bytes + files_failed_bytes) / files_size_bytes

@amnonh
Copy link
Collaborator

amnonh commented Mar 3, 2024

The change that causes the current issue is #2122
scylla_manager_repair_progress shows 100 but no repair is running

@Michal-Leszczynski, The issues around the manager metrics have been hunting us for a few years now; please look seriously at the problem.

The manager creates a lot of metrics no one looks at and lacks the few metrics we need.

I also see that we are using scylla_manager_task_active_count from what version is scylla_manager_scheduler_run_indicator valid?

@amnonh
Copy link
Collaborator

amnonh commented Mar 5, 2024

@Michal-Leszczynski ping, I would like to have it resolved for 4.7 release

@Michal-Leszczynski
Copy link

Michal-Leszczynski commented Mar 6, 2024

The change that causes the current issue is #2122
scylla_manager_repair_progress shows 100 but no repair is running

At the beginning of each task run SM resets (sets to -1) all metrics of given task type from this cluster, but it doesn't reset it at the end of the task run. So the 100% is just a leftover from the previous repair.

I can change it so SM resets task metrics at the end of each task run as well. Paused or failed tasks won't have their metrics reset. Would that be ok?

I also see that we are using scylla_manager_task_active_count from what version is scylla_manager_scheduler_run_indicator valid?

From SM 3.0.

The manager creates a lot of metrics no one looks at and lacks the few metrics we need.

Ref: scylladb/scylla-manager#3732, please answer my question there so that we can decide how to approach this problem.

@amnonh
Copy link
Collaborator

amnonh commented Mar 6, 2024

@Michal-Leszczynski I still don't have an anser. We need to have a solution for the next monitoring release (4.7), manager information is currently broken.

So regardless of future improvements, under what we have, how can we tell that a repair/backup is currently running and their progress.

@Michal-Leszczynski
Copy link

Currently running task:
scylla_manager_scheduler_run_indicator

Task progress:
#2191 (comment)

@Michal-Leszczynski
Copy link

Perhaps I don't understand the problem, because I think I answered those question a few times by now?
Why are those answers not useful for this issue?

@amnonh
Copy link
Collaborator

amnonh commented Mar 6, 2024

@Michal-Leszczynski, because it's inconsistent and unclear we have scylla_manager_task_active_count and scylla_manager_scheduler_run_indicator

We don't show task id, we need to know running, not running for repair/backup how exactly do I determine that?

how exactly, do I determine the progress, and showing 100% when nothing is running is not good enough.

@Michal-Leszczynski
Copy link

we have scylla_manager_task_active_count

Actually, I believe that this metric has been deleted starting with SM 3.0.

We don't show task id, we need to know running, not running for repair/backup how exactly do I determine that?

You can use something like:

is_task_type_running = sum(scylla_manager_scheduler_run_indicator where cluster=ID and type=repair and task=*) > 0

how exactly, do I determine the progress, and showing 100% when nothing is running is not good enough.

You can multiply progress from scylla_manager_repair_progress by is_task_type_running from above. It would take care of the 100% progress when the task is not running - but it would also mean that there is always 0% progress when task is paused. The same goes for backup progress formula described here.

Without any changes to SM, it is not possible to see difference between paused and finished tasks by just looking at the metrics.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working right
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants