Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

scylla-advanced: Add a panel for scylla_io_queue_flow_ratio #2312

Merged
merged 1 commit into from
Jun 17, 2024

Conversation

amnonh
Copy link
Collaborator

@amnonh amnonh commented Jun 10, 2024

This patch adds a panel that shows scylla_io_queue_flow_ratio.
Screenshot_20240610_214114

Fixes #2306

@amnonh amnonh requested a review from vladzcloudius June 10, 2024 19:18
}
],
"description": "This graph shows the ratio of dispatch rate to completion rate. It is expected to be 1.0, growing larger on reactor stalls or disk problems.\n\nscylla_io_queue_flow_ratio",
"title": "I/O Group [[iogroup]] Queue flow ratio"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have this graph collapsed by default similarly to Tombstones and MVs on the Detailed dashboard.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A panel cannot be collapsed; I can collapse the row that contains this graph with scylla_io_queue_consumption

This patch adds a panel that shows scylla_io_queue_flow_ratio.

Fixes scylladb#2306

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
@amnonh
Copy link
Collaborator Author

amnonh commented Jun 10, 2024

@vladzcloudius, how can we progress with this panel? The upcoming Scylla monitoring release will have the option to scroll through the tooltips. But I think it will not be enough and we should only show the average and remove anything that is close to 1, so the normal case will not be shown; I'll appreciate your thoughts.

@vladzcloudius
Copy link
Contributor

vladzcloudius commented Jun 12, 2024

@vladzcloudius, how can we progress with this panel? The upcoming Scylla monitoring release will have the option to scroll through the tooltips. But I think it will not be enough and we should only show the average and remove anything that is close to 1, so the normal case will not be shown; I'll appreciate your thoughts.

I agree that we should aggregate.
However I think it should not be an average.
Imagine that you have values [0.7, 1.3]. Your average is going to be 1.0 and you'd not notice that you have a problem.

STDEV on the other hand should be a more informative one.
Take a look at the examples below:

Noisy set Ideal Set Quiet set One outlier
0.7 1 1.1 1
1.3 1 0.9 1
1.5 1 1.05 1
0.5 1 0.95 1
0.6 1 1.005 1
1.4 1 0.995 1.4
Average 1 1 1 1.07
STDEV 0.45 0.00 0.07 0.16

@xemul WDYT?
I would not want to go all the way to histograms here which would probably give the best statistical visibility but would be most expensive too.
If you think STDEV is a good option - can you try to estimate what would be a "good" range threshold after which we should start digging deeper?

What we are looking for here is a clear way to work with this metric for Support people.
The instruction + aggregation should be able to answer the following simple question: "When what I see means something bad is going on?"

@amnonh
Copy link
Collaborator Author

amnonh commented Jun 12, 2024

@vladzcloudius I wrote multiple ideas, and none of them was average.
I think we should show the average as one line, and show either just the outliers (e.g. remove anything too close to the average) or let the user pick aggregate function.

@vladzcloudius
Copy link
Contributor

@vladzcloudius I wrote multiple ideas, and none of them was average. I think we should show the average as one line, and show either just the outliers (e.g. remove anything too close to the average) or let the user pick aggregate function.

The problem is that it may end up showing all shards values. If we can't find anything better - we can start with this. However I was hoping we WOULD find something better.

@amnonh
Copy link
Collaborator Author

amnonh commented Jun 12, 2024

The problem is that it may end up showing all shards values. If we can't find anything better - we can start with this. However I was hoping we WOULD find something better.

Do you have some real life examples for what would be a safe threshold to remove? e.g. remove the range (0.9-1.1)

@vladzcloudius
Copy link
Contributor

The problem is that it may end up showing all shards values. If we can't find anything better - we can start with this. However I was hoping we WOULD find something better.

Do you have some real life examples for what would be a safe threshold to remove? e.g. remove the range (0.9-1.1)

You can pick any I/O heavy SC or FC cluster.
I asked @xemul for a threshold here: #2312 (comment)

@amnonh amnonh merged commit dcd45b5 into scylladb:master Jun 17, 2024
@amnonh amnonh deleted the io_groups branch June 17, 2024 18:33
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a graph for scylla_io_queue_flow_ratio
2 participants