-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
timeouts and latencies per shards panels are missing #1294
Comments
@amnonh FYI |
Can you explained (i.e. dashboard and panels) what is the exact issue? Please follow the bug report template, with adding the dashboard name and panel, make sure I'll have all the info and I'll push it to 3.6.2 |
Installation details in 3.6.1 on the panel the timeouts are not shown properly |
and similar on latencies panel in Overview |
@tarzanek think about someone reading this issue, what do you mean not shown properly? you can add an image if it's hard to explain, but please make it clear |
no prob Amnon, I usually think about people who read the issue but this time I forgot to take the picture and exact time span |
note the 7.5s latencies seen only on few shards |
btw. those pictures are completely unseen when you have the new overview dashboard |
I have also uploaded data_23_2_hq_eu_7.tgz |
also I am puzzled by those 7.5s limit - is this something we have in prometheus(grafana?) or in scylla? but it impacted all other queries in system, which definitely had 20s timeouts ... so this 7.5s top limit is something weird (and probably deserves its own bug) |
@tarzanek I'm sorry that it takes so long, but nowhere in this conversation you explained what is wrong. What do you mean by the word properly? I'm trying to read this issue and I don't see anywhere a simple explenation to the basic questions:
About the 7.5s, latencies are calculated using a histogram, what you see is that the 99% fell inside that histogram bucket, my guess, the timeout is in that bucket range. |
no prob, let's try to clarify the concern |
e.g. I didn't find a way to see above view with same data when I upgrade to 3.5 / 3.6.1 and without above view the troubleshooting is now very hard (you need to make your own dashboards and get back old panels) |
this looks like an awesome plan! |
I think Detailed is pretty loaded already. I think that we should always keep in mind that we should not fix what ain't broken. ;) Now more seriously: while it makes sense to see high percentiles latencies graphs close to queuing related graphs (fore/background xxx) this is not the only latencies hogger.
So, it would make sense to have all these on the same dashboard too. Another thought - it's not a crime to have the same graph on multiple dashboards if it makes sense. I think that we'd rather have aspect-specific dashboards like we have for I/O and OS instead of opaque "Overview" ones. So, Overview was always more like a "Latency" dashboard up until recently. So, I think we may want to take a step back and rethink:
And IMO pushing more and more stuff to Detailed dashboard isn't going to get us where we want to. |
Hello Amnon
so it seems all versions since 3.4 (so 3.5 and 3.6 are affected)
don't show proper latencies and timeouts per shard
for hq eu cluster and node hq eu 7 for date 24.2.2020
shard 6 is overloaded and monitoring won't show it (in any view/instance / shard)
while in 3.4 you can see the problematic latencies and timeouts immediatelly
prometheus data can be found in upload fffca528-18ec-45d6-ba02-7510114b32e7
(accessible to scylla gauth , gsutil can be used too)
https://console.cloud.google.com/storage/browser/_details/upload.scylladb.com/fffca528-18ec-45d6-ba02-7510114b32e7/feb-24-timeout-data-hq-eu-7.tgz?project=upload-179716
(or gsutil gs://upload.scylladb.com/fffca528-18ec-45d6-ba02-7510114b32e7/feb-24-timeout-data-hq-eu-7.tgz )
The text was updated successfully, but these errors were encountered: