scylla-advanced: Add a panel for scylla_io_queue_flow_ratio #2312

amnonh · 2024-06-10T19:18:16Z

This patch adds a panel that shows scylla_io_queue_flow_ratio.

vladzcloudius · 2024-06-10T21:36:03Z

grafana/scylla-advanced.template.json

+                            }
+                        ],
+                        "description": "This graph shows the ratio of dispatch rate to completion rate. It is expected to be 1.0, growing larger on reactor stalls or disk problems.\n\nscylla_io_queue_flow_ratio",
+                        "title": "I/O Group [[iogroup]] Queue flow ratio"


We should have this graph collapsed by default similarly to Tombstones and MVs on the Detailed dashboard.

A panel cannot be collapsed; I can collapse the row that contains this graph with scylla_io_queue_consumption

This patch adds a panel that shows scylla_io_queue_flow_ratio. Fixes scylladb#2306 Signed-off-by: Amnon Heiman <amnon@scylladb.com>

amnonh · 2024-06-10T23:20:45Z

@vladzcloudius, how can we progress with this panel? The upcoming Scylla monitoring release will have the option to scroll through the tooltips. But I think it will not be enough and we should only show the average and remove anything that is close to 1, so the normal case will not be shown; I'll appreciate your thoughts.

vladzcloudius · 2024-06-12T21:00:39Z

@vladzcloudius, how can we progress with this panel? The upcoming Scylla monitoring release will have the option to scroll through the tooltips. But I think it will not be enough and we should only show the average and remove anything that is close to 1, so the normal case will not be shown; I'll appreciate your thoughts.

I agree that we should aggregate.
However I think it should not be an average.
Imagine that you have values [0.7, 1.3]. Your average is going to be 1.0 and you'd not notice that you have a problem.

STDEV on the other hand should be a more informative one.
Take a look at the examples below:

	Noisy set	Ideal Set	Quiet set	One outlier
	0.7	1	1.1	1
	1.3	1	0.9	1
	1.5	1	1.05	1
	0.5	1	0.95	1
	0.6	1	1.005	1
	1.4	1	0.995	1.4
Average	1	1	1	1.07
STDEV	0.45	0.00	0.07	0.16

@xemul WDYT?
I would not want to go all the way to histograms here which would probably give the best statistical visibility but would be most expensive too.
If you think STDEV is a good option - can you try to estimate what would be a "good" range threshold after which we should start digging deeper?

What we are looking for here is a clear way to work with this metric for Support people.
The instruction + aggregation should be able to answer the following simple question: "When what I see means something bad is going on?"

amnonh · 2024-06-12T22:09:38Z

@vladzcloudius I wrote multiple ideas, and none of them was average.
I think we should show the average as one line, and show either just the outliers (e.g. remove anything too close to the average) or let the user pick aggregate function.

vladzcloudius · 2024-06-12T23:33:41Z

@vladzcloudius I wrote multiple ideas, and none of them was average. I think we should show the average as one line, and show either just the outliers (e.g. remove anything too close to the average) or let the user pick aggregate function.

The problem is that it may end up showing all shards values. If we can't find anything better - we can start with this. However I was hoping we WOULD find something better.

amnonh · 2024-06-12T23:38:20Z

The problem is that it may end up showing all shards values. If we can't find anything better - we can start with this. However I was hoping we WOULD find something better.

Do you have some real life examples for what would be a safe threshold to remove? e.g. remove the range (0.9-1.1)

vladzcloudius · 2024-06-14T13:01:17Z

The problem is that it may end up showing all shards values. If we can't find anything better - we can start with this. However I was hoping we WOULD find something better.

Do you have some real life examples for what would be a safe threshold to remove? e.g. remove the range (0.9-1.1)

You can pick any I/O heavy SC or FC cluster.
I asked @xemul for a threshold here: #2312 (comment)

amnonh requested a review from vladzcloudius June 10, 2024 19:18

vladzcloudius suggested changes Jun 10, 2024

View reviewed changes

scylla-advanced: Add a panel for scylla_io_queue_flow_ratio

28f43db

This patch adds a panel that shows scylla_io_queue_flow_ratio. Fixes scylladb#2306 Signed-off-by: Amnon Heiman <amnon@scylladb.com>

amnonh force-pushed the io_groups branch from d7d3b6b to 28f43db Compare June 10, 2024 22:41

amnonh merged commit dcd45b5 into scylladb:master Jun 17, 2024

amnonh deleted the io_groups branch June 17, 2024 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scylla-advanced: Add a panel for scylla_io_queue_flow_ratio #2312

scylla-advanced: Add a panel for scylla_io_queue_flow_ratio #2312

amnonh commented Jun 10, 2024

vladzcloudius Jun 10, 2024

amnonh Jun 10, 2024

amnonh commented Jun 10, 2024

vladzcloudius commented Jun 12, 2024 •

edited

Loading

amnonh commented Jun 12, 2024

vladzcloudius commented Jun 12, 2024

amnonh commented Jun 12, 2024

vladzcloudius commented Jun 14, 2024

scylla-advanced: Add a panel for scylla_io_queue_flow_ratio #2312

scylla-advanced: Add a panel for scylla_io_queue_flow_ratio #2312

Conversation

amnonh commented Jun 10, 2024

vladzcloudius Jun 10, 2024

Choose a reason for hiding this comment

amnonh Jun 10, 2024

Choose a reason for hiding this comment

amnonh commented Jun 10, 2024

vladzcloudius commented Jun 12, 2024 • edited Loading

amnonh commented Jun 12, 2024

vladzcloudius commented Jun 12, 2024

amnonh commented Jun 12, 2024

vladzcloudius commented Jun 14, 2024

vladzcloudius commented Jun 12, 2024 •

edited

Loading