-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Manager metrics are inconsistent #2191
Comments
Good question |
@Michal-Leszczynski The issue we are facing, we don't have a consistent way of knowing when a repair/backup is in progress. Note that when the repair/backup is done, we will show 0% |
There is a metric
There are some metrics that can be used for progress calculation, although they might be confusing (e.g. token ranges aren't weighted by table size). We can add a general |
back to the previous question, what is the simplest way to know if a repair/back is in progress and how far. For a future metric, let's see how we can reduce the number of metrics as much as possible and have fewer metrics that tell us what is happening. Thousands of metrics are not helpful. |
This can be used to check if a task is running. In terms of progress:
filesSizeBytes: g("Total size of backup files in bytes.",
"files_size_bytes", "cluster", "keyspace", "table", "host"),
filesUploadedBytes: g("Number of bytes uploaded to backup location.",
"files_uploaded_bytes", "cluster", "keyspace", "table", "host"),
filesSkippedBytes: g("Number of deduplicated bytes already uploaded to backup location.",
"files_skipped_bytes", "cluster", "keyspace", "table", "host"),
filesFailedBytes: g("Number of bytes failed to upload to backup location.",
"files_failed_bytes", "cluster", "keyspace", "table", "host"), so |
The change that causes the current issue is #2122 @Michal-Leszczynski, The issues around the manager metrics have been hunting us for a few years now; please look seriously at the problem. The manager creates a lot of metrics no one looks at and lacks the few metrics we need. I also see that we are using |
@Michal-Leszczynski ping, I would like to have it resolved for 4.7 release |
At the beginning of each task run SM resets (sets to -1) all metrics of given task type from this cluster, but it doesn't reset it at the end of the task run. So the 100% is just a leftover from the previous repair. I can change it so SM resets task metrics at the end of each task run as well. Paused or failed tasks won't have their metrics reset. Would that be ok?
From SM 3.0.
Ref: scylladb/scylla-manager#3732, please answer my question there so that we can decide how to approach this problem. |
@Michal-Leszczynski I still don't have an anser. We need to have a solution for the next monitoring release (4.7), manager information is currently broken. So regardless of future improvements, under what we have, how can we tell that a repair/backup is currently running and their progress. |
Currently running task: Task progress: |
Perhaps I don't understand the problem, because I think I answered those question a few times by now? |
@Michal-Leszczynski, because it's inconsistent and unclear we have scylla_manager_task_active_count and scylla_manager_scheduler_run_indicator We don't show task id, we need to know running, not running for repair/backup how exactly do I determine that? how exactly, do I determine the progress, and showing 100% when nothing is running is not good enough. |
Actually, I believe that this metric has been deleted starting with SM 3.0.
You can use something like:
You can multiply progress from Without any changes to SM, it is not possible to see difference between paused and finished tasks by just looking at the metrics. |
v4.6.2, just next to Manager stats, we have a green '100%' without description or anything:
data:image/s3,"s3://crabby-images/f8b51/f8b5112c32fe197a31358ed26a48634c97f81e8e" alt="image"
The text was updated successfully, but these errors were encountered: