timeouts and latencies per shards panels are missing #1294

tarzanek · 2021-02-26T08:38:09Z

Hello Amnon
so it seems all versions since 3.4 (so 3.5 and 3.6 are affected)
don't show proper latencies and timeouts per shard

for hq eu cluster and node hq eu 7 for date 24.2.2020
shard 6 is overloaded and monitoring won't show it (in any view/instance / shard)
while in 3.4 you can see the problematic latencies and timeouts immediatelly

prometheus data can be found in upload fffca528-18ec-45d6-ba02-7510114b32e7
(accessible to scylla gauth , gsutil can be used too)
https://console.cloud.google.com/storage/browser/_details/upload.scylladb.com/fffca528-18ec-45d6-ba02-7510114b32e7/feb-24-timeout-data-hq-eu-7.tgz?project=upload-179716
(or gsutil gs://upload.scylladb.com/fffca528-18ec-45d6-ba02-7510114b32e7/feb-24-timeout-data-hq-eu-7.tgz )

vladzcloudius · 2021-02-26T09:16:14Z

@amnonh FYI
This technically makes impossible to use 3.4 and 3.5.
Please, give it a priority.

amnonh · 2021-02-26T11:19:59Z

Can you explained (i.e. dashboard and panels) what is the exact issue?
is that a panel/dashboard that was changed or missing, is there a problem with a metrics?

Please follow the bug report template, with adding the dashboard name and panel, make sure I'll have all the info and I'll push it to 3.6.2

tarzanek · 2021-03-02T08:42:56Z

Installation details
Panel Name: Timeouts and Errors - Coordinator
Dashboard Name: detailed-2020-1
Scylla-Monitoring Version: (can be found at the buttom of the overview dashboard) 3.6.1 downgraded to 3.4 (where this works)
Scylla-Version: 2020.1.6

in 3.6.1 on the panel the timeouts are not shown properly
in 3.4 they are (though still throttled to 8s or some upper limit like that - I expect this comes from prometheus(scylla scrapes) so it's not monitoring bug)

tarzanek · 2021-03-02T08:44:53Z

and similar on latencies panel in Overview
( Latencies - Coordinator / Overview dashboard )

amnonh · 2021-03-02T10:25:21Z

@tarzanek think about someone reading this issue, what do you mean not shown properly? you can add an image if it's hard to explain, but please make it clear

tarzanek · 2021-03-03T08:00:33Z

no prob Amnon, I usually think about people who read the issue but this time I forgot to take the picture and exact time span
and you're right, it will be easier for you with a picture, I'll attach them

tarzanek · 2021-03-03T08:06:30Z

so 3.4 has these pictures;

this is Detailed dashboard from=1614088559792&to=1614098159755 instance,shard view node escylla-hq-graph-prod-eu-7
(I know, timestamp is from 23.2. but same can be found on 24.2. )
and for the same view I click the overview button and get:

note the 7.5s latencies seen only on few shards
that is critical view when troubleshooting unexpected timeouts ... this view is now lost with new overview screen
I unfortunately lack new screens, since I reverted the monitoring to 3.4, so above are 3.4
but in 3.5 or 3.6 there is no way for me to detect that during I see those timeouts the latencies on some shard fire too high and funny enough they impact the whole cluster
(above is a system_auth caching issue with lots of spark clients)
so with 3.5 or 3.6 there would be no way for me to troubleshoot this unless I have custom dashboard
So is there a way to get old overview dashboard back? Or at least this per shard view on top of 95th percentile latencies?

tarzanek · 2021-03-03T08:15:22Z

and one more screen shot, I just realized all shards are busy there, but mostly it's just few of them (that have the hot partition of system_auth)

tarzanek · 2021-03-03T08:16:16Z

btw. those pictures are completely unseen when you have the new overview dashboard

tarzanek · 2021-03-03T08:20:00Z

I have also uploaded data_23_2_hq_eu_7.tgz
to fffca528-18ec-45d6-ba02-7510114b32e7 now so you have the data from 23.2. too and can use above timestamps

tarzanek · 2021-03-03T08:21:01Z

also I am puzzled by those 7.5s limit - is this something we have in prometheus(grafana?) or in scylla?
or internally?
note that those boxes were set with 20s timeouts for read, write and range queries
but since above is internal system_auth query, it is likely hardcoded (or cut by monitoring? not sure)

but it impacted all other queries in system, which definitely had 20s timeouts ... so this 7.5s top limit is something weird (and probably deserves its own bug)

amnonh · 2021-03-03T09:44:01Z

@tarzanek I'm sorry that it takes so long, but nowhere in this conversation you explained what is wrong.

What do you mean by the word properly? I'm trying to read this issue and I don't see anywhere a simple explenation to the basic questions:

Are you missing a panel that was there and now is gone?
Is an exsiting panel works differently?
and properly is not an explenation.

About the 7.5s, latencies are calculated using a histogram, what you see is that the 99% fell inside that histogram bucket, my guess, the timeout is in that bucket range.

tarzanek · 2021-03-03T10:11:47Z

no prob, let's try to clarify the concern
so on Overview panel we used to have latencies, now they are merged into single pane
and this single pane didn't show what you see above with old panels
so this is what I am missing - proper view on latencies with shard level resolution

tarzanek · 2021-03-03T10:12:25Z

e.g. I didn't find a way to see above view with same data when I upgrade to 3.5 / 3.6.1

and without above view the troubleshooting is now very hard (you need to make your own dashboards and get back old panels)

amnonh · 2021-03-03T12:25:26Z

This is what I'm planning to do (open and enlarge)
I'll add the latencies to the detailed dashboard, the timeout will get their own row and an additional row with 95,99 latencies will be added

tarzanek · 2021-03-03T13:52:27Z

this looks like an awesome plan!
@vladzcloudius any comments from your side?

vladzcloudius · 2021-03-08T23:12:05Z

this looks like an awesome plan!
@vladzcloudius any comments from your side?

I think Detailed is pretty loaded already.

I think that we should always keep in mind that we should not fix what ain't broken. ;)
Some people may love dashboards as they are (were!) now, @amnonh. ;)

Now more seriously: while it makes sense to see high percentiles latencies graphs close to queuing related graphs (fore/background xxx) this is not the only latencies hogger.
It makes sense to see latencies near every queuing related graphs:

I/O classes queue lengths.
Execution stages queues lengths.
disks await graphs.

So, it would make sense to have all these on the same dashboard too.

Another thought - it's not a crime to have the same graph on multiple dashboards if it makes sense.

I think that we'd rather have aspect-specific dashboards like we have for I/O and OS instead of opaque "Overview" ones.

So, Overview was always more like a "Latency" dashboard up until recently.
And "Detailed" was much more of an overview than the actual "Overview" dashboard because you are likely going to start debugging with looking at the "Detailed" since it is the most informative dashboard.

So, I think we may want to take a step back and rethink:

How we want those dashboards to be used.
What we want to see on those dashboards.

And IMO pushing more and more stuff to Detailed dashboard isn't going to get us where we want to.

tarzanek added the bug Something isn't working right label Feb 26, 2021

vladzcloudius added the High label Feb 26, 2021

amnonh added the Need Info label Feb 27, 2021

amnonh assigned tarzanek Feb 27, 2021

amnonh added this to the monitoring 3.6.2 milestone Feb 28, 2021

tarzanek assigned amnonh and unassigned tarzanek Mar 2, 2021

tarzanek removed the Need Info label Mar 3, 2021

amnonh changed the title ~~improper timeouts and latencies shown on metrics~~ timeouts and latencies per shards panels are missing Mar 3, 2021

amnonh mentioned this issue Mar 3, 2021

Adding p99, p95 Latencies to detailed dashboard #1301

Merged

amnonh closed this as completed in #1301 Mar 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

timeouts and latencies per shards panels are missing #1294

timeouts and latencies per shards panels are missing #1294

tarzanek commented Feb 26, 2021

vladzcloudius commented Feb 26, 2021

amnonh commented Feb 26, 2021

tarzanek commented Mar 2, 2021

tarzanek commented Mar 2, 2021

amnonh commented Mar 2, 2021

tarzanek commented Mar 3, 2021

tarzanek commented Mar 3, 2021

tarzanek commented Mar 3, 2021

tarzanek commented Mar 3, 2021

tarzanek commented Mar 3, 2021

tarzanek commented Mar 3, 2021 •

edited

Loading

amnonh commented Mar 3, 2021

tarzanek commented Mar 3, 2021

tarzanek commented Mar 3, 2021 •

edited

Loading

amnonh commented Mar 3, 2021

tarzanek commented Mar 3, 2021

vladzcloudius commented Mar 8, 2021

timeouts and latencies per shards panels are missing #1294

timeouts and latencies per shards panels are missing #1294

Comments

tarzanek commented Feb 26, 2021

vladzcloudius commented Feb 26, 2021

amnonh commented Feb 26, 2021

tarzanek commented Mar 2, 2021

tarzanek commented Mar 2, 2021

amnonh commented Mar 2, 2021

tarzanek commented Mar 3, 2021

tarzanek commented Mar 3, 2021

tarzanek commented Mar 3, 2021

tarzanek commented Mar 3, 2021

tarzanek commented Mar 3, 2021

tarzanek commented Mar 3, 2021 • edited Loading

amnonh commented Mar 3, 2021

tarzanek commented Mar 3, 2021

tarzanek commented Mar 3, 2021 • edited Loading

amnonh commented Mar 3, 2021

tarzanek commented Mar 3, 2021

vladzcloudius commented Mar 8, 2021

tarzanek commented Mar 3, 2021 •

edited

Loading

tarzanek commented Mar 3, 2021 •

edited

Loading