-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
large workflow hot shard detection #5166
Conversation
Co-authored-by: Steven L <imgroxx@gmail.com>
Co-authored-by: Steven L <imgroxx@gmail.com>
Co-authored-by: Steven L <imgroxx@gmail.com>
service/history/execution/context.go
Outdated
@@ -852,6 +854,7 @@ func (c *contextImpl) UpdateWorkflowExecutionWithNew( | |||
domainName, | |||
resp.MutableStateUpdateSessionStats, | |||
) | |||
c.emitLargeWorkflowShardIDStats(currentWorkflowSize-oldWorkflowSize, oldWorkflowHistoryCount, oldWorkflowSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while I'm exploring the code in here:
would any of these be more-correct if they came from newWorkflow
? and/or would using newWorkflow
mean we wouldn't need to maintain as much separately?
I'm not quite sure what the difference is tbh. I'd have to read more carefully. I don't think it'll be dangerous to use the wrong one, just possibly misleading (due to subtly-wrong values, or due to unnecessary duplicate calculations that could drift)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it matters. From what I can tell new workflow kind of derives the numbers the same way I am. Maybe it would be less duplicated code but at most it's saving 5 lines and makes it a bit more confusing to read imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the main reason to prefer using the new-workflow's data is that it'd make it clear what the source of truth is, and get rid of any risk of the calculations drifting away from that source of truth due to future changes.
which is an "if". tbh I'm not sure if it's more-correct or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could try testing it later? The other 2 metric emission fns is kinda calculated the same way so i wanted to make it consistent. I don't wanna just change part of it then it's super confusing if we never change the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which ones are those?
but yea, later's fine (if ever)
if blobSize > blobSizeWarn { | ||
c.logger.SampleInfo("Workflow writing a large blob", c.shard.GetConfig().SampleLoggingRate(), tag.WorkflowDomainName(c.GetDomainName()), | ||
tag.WorkflowID(c.workflowExecution.GetWorkflowID()), tag.ShardID(c.shard.GetShardID())) | ||
c.metricsClient.Scope(metrics.LargeExecutionBlobShardScope, metrics.ShardIDTag(shardIDStr)).IncCounter(metrics.LargeHistoryBlobCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kinda odd at a glance that this takes a string, but meh. maybe there's a reason for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 yep, lgtm now. one last Q on what the other metrics-y funcs are for my own reading, but I think this is ready to go
commit 9d01035 Author: allenchen2244 <102192478+allenchen2244@users.noreply.github.com> Date: Wed Mar 29 20:50:38 2023 -0700 large workflow hot shard detection (#5166) Metrics for large workflows commit dd51c53 Author: David Porter <david.porter@uber.com> Date: Wed Mar 29 18:30:06 2023 -0700 fix build (#5180) commit 7b281c2 Author: David Porter <david.porter@uber.com> Date: Mon Mar 27 10:38:37 2023 -0700 Adds a small test to catch issues with deadlocks (#5171) * Adds a small test to catch issues with deadlocks commit f1e2476 Author: sonpham96 <sonpham1996@gmail.com> Date: Sat Mar 18 05:32:01 2023 +0700 Upgrade Golang base image to 1.18 to remediate CVEs (#5035) Co-authored-by: David Porter <david.porter@uber.com> commit 1519ace Author: charlese-instaclustr <76502507+charlese-instaclustr@users.noreply.github.com> Date: Fri Mar 17 22:11:27 2023 +0000 Fix type validation in configstore DC client value updating (#5110) * Remove misleading type check, Add more detailed log message * removing debugging logging * Handle nil update edge case --------- Co-authored-by: allenchen2244 <102192478+allenchen2244@users.noreply.github.com> Co-authored-by: Zijian <Shaddoll@users.noreply.github.com> commit a3e2774 Author: charlese-instaclustr <76502507+charlese-instaclustr@users.noreply.github.com> Date: Fri Mar 17 19:02:40 2023 +0000 Add Canary TLS support (#5086) * add support for TLS connections by Canary, add development config for Canary with TLS * update README to include new config option * remove testing config --------- Co-authored-by: David Porter <david.porter@uber.com> Co-authored-by: Shijie Sheng <shengs@uber.com> Co-authored-by: Zijian <Shaddoll@users.noreply.github.com> commit ff4eab2 Author: Shijie Sheng <shengs@uber.com> Date: Thu Mar 16 20:10:54 2023 -0700 [history] more cautious in deciding domain state to make decisions on dropping queued tasks (#5164) What changed? When domain cache returned entity not found error, don't drop queued tasks to be more conservative. Why? In cases when the cache is dubious, we shouldn't drop the queued tasks. commit 55a8d93 Author: neil-xie <104041627+neil-xie@users.noreply.github.com> Date: Thu Mar 16 14:18:35 2023 -0700 Add Pinot docker files, table config and schema (#5163) * Initial checkin for pinot config files commit 1304570 Author: Mantas Šidlauskas <mantass@netapp.com> Date: Thu Mar 16 15:20:29 2023 +0200 Set poll interval for filebased dynamic config if not set (#5160) * Set poll interval for filebased dynamic config if not set * update unit test commit 42a14b1 Author: Mantas Šidlauskas <mantass@netapp.com> Date: Thu Mar 16 10:49:21 2023 +0200 Elasticsearch: reduce code duplication (#5137) * Elasticsearch: reduce code duplication * address comments --------- Co-authored-by: Zijian <Shaddoll@users.noreply.github.com> commit cbf0d14 Author: bowen xiao <xbowen@uber.com> Date: Wed Mar 15 10:19:34 2023 -0700 fix samples documentation (#5088) commit ba19a29 Author: Mantas Šidlauskas <mantass@netapp.com> Date: Wed Mar 15 12:52:29 2023 +0200 Add ShardID to valid attributes (#5161) commit a25cba8 Author: Mantas Šidlauskas <mantass@netapp.com> Date: Wed Mar 15 10:56:50 2023 +0200 ES: single interface for different ES/OpenSearch versions (#5158) * ES: single interface for different ES/OpenSearch versions * make fmt commit e3ac246 Author: Ketsia <115650494+ketsiambaku@users.noreply.github.com> Date: Tue Mar 14 12:47:40 2023 -0700 added logging with workflow/domain tags (#5159) commit 9581488 Author: Ketsia <115650494+ketsiambaku@users.noreply.github.com> Date: Mon Mar 13 16:56:45 2023 -0700 Consistent query pershard metric (#5143) * added and update consistent query per shard metric * testing pershard metric * move sample logger into persistence metric client for cleaness * fix test * fix lint * fix test again * fix lint * sample logging with workflowid tag * added domain tag to logger * metric completed * addressing comments * fix lint * Revert "fix lint" This reverts commit 1e96944. * fix lint second attempt --------- Co-authored-by: Allen Chen <allenchen2244@uber.com>
What changed?
Added metrics to check if shardId has large blobs/historySize/historyCount
Why?
Identify shards that is under heavy load.
How did you test it?
Auto tests pass. Fully tested everything end to end in staging using cadence-killers to test different pathological scenarios
Potential risks
Just metrics and logs. Can't think of anything