Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

timeouts and latencies per shards panels are missing #1294

Closed
tarzanek opened this issue Feb 26, 2021 · 17 comments · Fixed by #1301
Closed

timeouts and latencies per shards panels are missing #1294

tarzanek opened this issue Feb 26, 2021 · 17 comments · Fixed by #1301
Assignees
Labels
bug Something isn't working right High

Comments

@tarzanek
Copy link

Hello Amnon
so it seems all versions since 3.4 (so 3.5 and 3.6 are affected)
don't show proper latencies and timeouts per shard

for hq eu cluster and node hq eu 7 for date 24.2.2020
shard 6 is overloaded and monitoring won't show it (in any view/instance / shard)
while in 3.4 you can see the problematic latencies and timeouts immediatelly

prometheus data can be found in upload fffca528-18ec-45d6-ba02-7510114b32e7
(accessible to scylla gauth , gsutil can be used too)
https://console.cloud.google.com/storage/browser/_details/upload.scylladb.com/fffca528-18ec-45d6-ba02-7510114b32e7/feb-24-timeout-data-hq-eu-7.tgz?project=upload-179716
(or gsutil gs://upload.scylladb.com/fffca528-18ec-45d6-ba02-7510114b32e7/feb-24-timeout-data-hq-eu-7.tgz )

@tarzanek tarzanek added the bug Something isn't working right label Feb 26, 2021
@vladzcloudius
Copy link
Contributor

@amnonh FYI
This technically makes impossible to use 3.4 and 3.5.
Please, give it a priority.

@amnonh
Copy link
Collaborator

amnonh commented Feb 26, 2021

Can you explained (i.e. dashboard and panels) what is the exact issue?
is that a panel/dashboard that was changed or missing, is there a problem with a metrics?

Please follow the bug report template, with adding the dashboard name and panel, make sure I'll have all the info and I'll push it to 3.6.2

@amnonh amnonh added this to the monitoring 3.6.2 milestone Feb 28, 2021
@tarzanek
Copy link
Author

tarzanek commented Mar 2, 2021

Installation details
Panel Name: Timeouts and Errors - Coordinator
Dashboard Name: detailed-2020-1
Scylla-Monitoring Version: (can be found at the buttom of the overview dashboard) 3.6.1 downgraded to 3.4 (where this works)
Scylla-Version: 2020.1.6

in 3.6.1 on the panel the timeouts are not shown properly
in 3.4 they are (though still throttled to 8s or some upper limit like that - I expect this comes from prometheus(scylla scrapes) so it's not monitoring bug)

@tarzanek tarzanek assigned amnonh and unassigned tarzanek Mar 2, 2021
@tarzanek
Copy link
Author

tarzanek commented Mar 2, 2021

and similar on latencies panel in Overview
( Latencies - Coordinator / Overview dashboard )

@amnonh
Copy link
Collaborator

amnonh commented Mar 2, 2021

@tarzanek think about someone reading this issue, what do you mean not shown properly? you can add an image if it's hard to explain, but please make it clear

@tarzanek
Copy link
Author

tarzanek commented Mar 3, 2021

no prob Amnon, I usually think about people who read the issue but this time I forgot to take the picture and exact time span
and you're right, it will be easier for you with a picture, I'll attach them

@tarzanek
Copy link
Author

tarzanek commented Mar 3, 2021

so 3.4 has these pictures;
image

  • this is Detailed dashboard from=1614088559792&to=1614098159755 instance,shard view node escylla-hq-graph-prod-eu-7
    (I know, timestamp is from 23.2. but same can be found on 24.2. )
    and for the same view I click the overview button and get:

image

note the 7.5s latencies seen only on few shards
that is critical view when troubleshooting unexpected timeouts ... this view is now lost with new overview screen
I unfortunately lack new screens, since I reverted the monitoring to 3.4, so above are 3.4
but in 3.5 or 3.6 there is no way for me to detect that during I see those timeouts the latencies on some shard fire too high and funny enough they impact the whole cluster
(above is a system_auth caching issue with lots of spark clients)
so with 3.5 or 3.6 there would be no way for me to troubleshoot this unless I have custom dashboard
So is there a way to get old overview dashboard back? Or at least this per shard view on top of 95th percentile latencies?

@tarzanek
Copy link
Author

tarzanek commented Mar 3, 2021

and one more screen shot, I just realized all shards are busy there, but mostly it's just few of them (that have the hot partition of system_auth)
image

@tarzanek
Copy link
Author

tarzanek commented Mar 3, 2021

btw. those pictures are completely unseen when you have the new overview dashboard

@tarzanek
Copy link
Author

tarzanek commented Mar 3, 2021

I have also uploaded data_23_2_hq_eu_7.tgz
to fffca528-18ec-45d6-ba02-7510114b32e7 now so you have the data from 23.2. too and can use above timestamps

@tarzanek tarzanek removed the Need Info label Mar 3, 2021
@tarzanek
Copy link
Author

tarzanek commented Mar 3, 2021

also I am puzzled by those 7.5s limit - is this something we have in prometheus(grafana?) or in scylla?
or internally?
note that those boxes were set with 20s timeouts for read, write and range queries
but since above is internal system_auth query, it is likely hardcoded (or cut by monitoring? not sure)

but it impacted all other queries in system, which definitely had 20s timeouts ... so this 7.5s top limit is something weird (and probably deserves its own bug)

@amnonh
Copy link
Collaborator

amnonh commented Mar 3, 2021

@tarzanek I'm sorry that it takes so long, but nowhere in this conversation you explained what is wrong.

What do you mean by the word properly? I'm trying to read this issue and I don't see anywhere a simple explenation to the basic questions:

  • Are you missing a panel that was there and now is gone?
  • Is an exsiting panel works differently?
    and properly is not an explenation.

About the 7.5s, latencies are calculated using a histogram, what you see is that the 99% fell inside that histogram bucket, my guess, the timeout is in that bucket range.

@tarzanek
Copy link
Author

tarzanek commented Mar 3, 2021

no prob, let's try to clarify the concern
so on Overview panel we used to have latencies, now they are merged into single pane
and this single pane didn't show what you see above with old panels
so this is what I am missing - proper view on latencies with shard level resolution

@tarzanek
Copy link
Author

tarzanek commented Mar 3, 2021

e.g. I didn't find a way to see above view with same data when I upgrade to 3.5 / 3.6.1

and without above view the troubleshooting is now very hard (you need to make your own dashboards and get back old panels)

@amnonh amnonh changed the title improper timeouts and latencies shown on metrics timeouts and latencies per shards panels are missing Mar 3, 2021
@amnonh
Copy link
Collaborator

amnonh commented Mar 3, 2021

This is what I'm planning to do (open and enlarge)
I'll add the latencies to the detailed dashboard, the timeout will get their own row and an additional row with 95,99 latencies will be added
image

@tarzanek
Copy link
Author

tarzanek commented Mar 3, 2021

this looks like an awesome plan!
@vladzcloudius any comments from your side?

@vladzcloudius
Copy link
Contributor

this looks like an awesome plan!
@vladzcloudius any comments from your side?

I think Detailed is pretty loaded already.

I think that we should always keep in mind that we should not fix what ain't broken. ;)
Some people may love dashboards as they are (were!) now, @amnonh. ;)

Now more seriously: while it makes sense to see high percentiles latencies graphs close to queuing related graphs (fore/background xxx) this is not the only latencies hogger.
It makes sense to see latencies near every queuing related graphs:

  • I/O classes queue lengths.
  • Execution stages queues lengths.
  • disks await graphs.

So, it would make sense to have all these on the same dashboard too.

Another thought - it's not a crime to have the same graph on multiple dashboards if it makes sense.

I think that we'd rather have aspect-specific dashboards like we have for I/O and OS instead of opaque "Overview" ones.

So, Overview was always more like a "Latency" dashboard up until recently.
And "Detailed" was much more of an overview than the actual "Overview" dashboard because you are likely going to start debugging with looking at the "Detailed" since it is the most informative dashboard.

So, I think we may want to take a step back and rethink:

  • How we want those dashboards to be used.
  • What we want to see on those dashboards.

And IMO pushing more and more stuff to Detailed dashboard isn't going to get us where we want to.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working right High
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants