Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

large workflow hot shard detection #5166

Merged
merged 33 commits into from
Mar 30, 2023
Merged

large workflow hot shard detection #5166

merged 33 commits into from
Mar 30, 2023

Conversation

allenchen2244
Copy link
Contributor

@allenchen2244 allenchen2244 commented Mar 17, 2023

What changed?
Added metrics to check if shardId has large blobs/historySize/historyCount

Why?
Identify shards that is under heavy load.

How did you test it?
Auto tests pass. Fully tested everything end to end in staging using cadence-killers to test different pathological scenarios

Potential risks
Just metrics and logs. Can't think of anything

@coveralls
Copy link

coveralls commented Mar 17, 2023

Pull Request Test Coverage Report for Build 0187308e-5976-434b-913d-d8baabb8ff4c

  • 29 of 37 (78.38%) changed or added relevant lines in 4 files are covered.
  • 113 unchanged lines in 10 files lost coverage.
  • Overall coverage increased (+0.09%) to 57.173%

Changes Missing Coverage Covered Lines Changed/Added Lines %
service/history/execution/context_util.go 18 26 69.23%
Files with Coverage Reduction New Missed Lines %
common/task/weightedRoundRobinTaskScheduler.go 1 89.64%
common/task/fifoTaskScheduler.go 2 85.57%
service/history/queue/timer_gate.go 3 95.83%
service/history/task/fetcher.go 4 91.24%
service/history/decision/task_handler.go 5 72.33%
common/persistence/nosql/nosqlplugin/cassandra/workflow.go 6 59.55%
common/persistence/nosql/nosqlplugin/cassandra/workflowUtils.go 12 76.83%
service/history/task/task_util.go 20 70.57%
service/history/execution/mutable_state_task_refresher.go 21 62.66%
common/persistence/nosql/nosqlplugin/cassandra/workflowParsingUtils.go 39 81.99%
Totals Coverage Status
Change from base Build 01873022-f3f6-47bc-a05c-9ee5d87ab662: 0.09%
Covered Lines: 85388
Relevant Lines: 149351

💛 - Coveralls

@allenchen2244 allenchen2244 changed the title large workflow hot shard detection WIP large workflow hot shard detection Mar 22, 2023
allenchen2244 and others added 2 commits March 28, 2023 18:00
Co-authored-by: Steven L <imgroxx@gmail.com>
Co-authored-by: Steven L <imgroxx@gmail.com>
allenchen2244 and others added 2 commits March 28, 2023 18:31
@@ -852,6 +854,7 @@ func (c *contextImpl) UpdateWorkflowExecutionWithNew(
domainName,
resp.MutableStateUpdateSessionStats,
)
c.emitLargeWorkflowShardIDStats(currentWorkflowSize-oldWorkflowSize, oldWorkflowHistoryCount, oldWorkflowSize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I'm exploring the code in here:
would any of these be more-correct if they came from newWorkflow? and/or would using newWorkflow mean we wouldn't need to maintain as much separately?

I'm not quite sure what the difference is tbh. I'd have to read more carefully. I don't think it'll be dangerous to use the wrong one, just possibly misleading (due to subtly-wrong values, or due to unnecessary duplicate calculations that could drift)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters. From what I can tell new workflow kind of derives the numbers the same way I am. Maybe it would be less duplicated code but at most it's saving 5 lines and makes it a bit more confusing to read imo.

Copy link
Member

@Groxx Groxx Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main reason to prefer using the new-workflow's data is that it'd make it clear what the source of truth is, and get rid of any risk of the calculations drifting away from that source of truth due to future changes.

which is an "if". tbh I'm not sure if it's more-correct or not.

Copy link
Contributor Author

@allenchen2244 allenchen2244 Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could try testing it later? The other 2 metric emission fns is kinda calculated the same way so i wanted to make it consistent. I don't wanna just change part of it then it's super confusing if we never change the rest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which ones are those?
but yea, later's fine (if ever)

@allenchen2244 allenchen2244 enabled auto-merge (squash) March 29, 2023 02:34
if blobSize > blobSizeWarn {
c.logger.SampleInfo("Workflow writing a large blob", c.shard.GetConfig().SampleLoggingRate(), tag.WorkflowDomainName(c.GetDomainName()),
tag.WorkflowID(c.workflowExecution.GetWorkflowID()), tag.ShardID(c.shard.GetShardID()))
c.metricsClient.Scope(metrics.LargeExecutionBlobShardScope, metrics.ShardIDTag(shardIDStr)).IncCounter(metrics.LargeHistoryBlobCount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kinda odd at a glance that this takes a string, but meh. maybe there's a reason for it.

Copy link
Member

@Groxx Groxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 yep, lgtm now. one last Q on what the other metrics-y funcs are for my own reading, but I think this is ready to go

@allenchen2244 allenchen2244 merged commit 9d01035 into master Mar 30, 2023
@allenchen2244 allenchen2244 deleted the large-history-shards branch March 30, 2023 03:50
davidporter-id-au added a commit that referenced this pull request Mar 30, 2023
commit 9d01035
Author: allenchen2244 <102192478+allenchen2244@users.noreply.github.com>
Date:   Wed Mar 29 20:50:38 2023 -0700

    large workflow hot shard detection (#5166)

    Metrics for large workflows

commit dd51c53
Author: David Porter <david.porter@uber.com>
Date:   Wed Mar 29 18:30:06 2023 -0700

    fix build (#5180)

commit 7b281c2
Author: David Porter <david.porter@uber.com>
Date:   Mon Mar 27 10:38:37 2023 -0700

    Adds a small test to catch issues with deadlocks (#5171)

    * Adds a small test to catch issues with deadlocks

commit f1e2476
Author: sonpham96 <sonpham1996@gmail.com>
Date:   Sat Mar 18 05:32:01 2023 +0700

    Upgrade Golang base image to 1.18 to remediate CVEs (#5035)

    Co-authored-by: David Porter <david.porter@uber.com>

commit 1519ace
Author: charlese-instaclustr <76502507+charlese-instaclustr@users.noreply.github.com>
Date:   Fri Mar 17 22:11:27 2023 +0000

    Fix type validation in configstore DC client value updating (#5110)

    * Remove misleading type check, Add more detailed log message

    * removing debugging logging

    * Handle nil update edge case

    ---------

    Co-authored-by: allenchen2244 <102192478+allenchen2244@users.noreply.github.com>
    Co-authored-by: Zijian <Shaddoll@users.noreply.github.com>

commit a3e2774
Author: charlese-instaclustr <76502507+charlese-instaclustr@users.noreply.github.com>
Date:   Fri Mar 17 19:02:40 2023 +0000

    Add Canary TLS support (#5086)

    * add support for TLS connections by Canary, add development config for Canary with TLS

    * update README to include new config option

    * remove testing config

    ---------

    Co-authored-by: David Porter <david.porter@uber.com>
    Co-authored-by: Shijie Sheng <shengs@uber.com>
    Co-authored-by: Zijian <Shaddoll@users.noreply.github.com>

commit ff4eab2
Author: Shijie Sheng <shengs@uber.com>
Date:   Thu Mar 16 20:10:54 2023 -0700

    [history] more cautious in deciding domain state to make decisions on dropping queued tasks (#5164)

    What changed?

    When domain cache returned entity not found error, don't drop queued tasks to be more conservative.

    Why?

    In cases when the cache is dubious, we shouldn't drop the queued tasks.

commit 55a8d93
Author: neil-xie <104041627+neil-xie@users.noreply.github.com>
Date:   Thu Mar 16 14:18:35 2023 -0700

    Add Pinot docker files, table config and schema (#5163)

    * Initial checkin for pinot config files

commit 1304570
Author: Mantas Šidlauskas <mantass@netapp.com>
Date:   Thu Mar 16 15:20:29 2023 +0200

    Set poll interval for filebased dynamic config if not set (#5160)

    * Set poll interval for filebased dynamic config if not set

    * update unit test

commit 42a14b1
Author: Mantas Šidlauskas <mantass@netapp.com>
Date:   Thu Mar 16 10:49:21 2023 +0200

    Elasticsearch: reduce code duplication (#5137)

    * Elasticsearch: reduce code duplication

    * address comments

    ---------

    Co-authored-by: Zijian <Shaddoll@users.noreply.github.com>

commit cbf0d14
Author: bowen xiao <xbowen@uber.com>
Date:   Wed Mar 15 10:19:34 2023 -0700

    fix samples documentation (#5088)

commit ba19a29
Author: Mantas Šidlauskas <mantass@netapp.com>
Date:   Wed Mar 15 12:52:29 2023 +0200

    Add ShardID to valid attributes (#5161)

commit a25cba8
Author: Mantas Šidlauskas <mantass@netapp.com>
Date:   Wed Mar 15 10:56:50 2023 +0200

    ES: single interface for different ES/OpenSearch versions (#5158)

    * ES: single interface for different ES/OpenSearch versions

    * make fmt

commit e3ac246
Author: Ketsia <115650494+ketsiambaku@users.noreply.github.com>
Date:   Tue Mar 14 12:47:40 2023 -0700

    added logging with workflow/domain tags (#5159)

commit 9581488
Author: Ketsia <115650494+ketsiambaku@users.noreply.github.com>
Date:   Mon Mar 13 16:56:45 2023 -0700

    Consistent query pershard metric (#5143)

    * added and update consistent query per shard metric

    * testing pershard metric

    * move sample logger into persistence metric client for cleaness

    * fix test

    * fix lint

    * fix test again

    * fix lint

    * sample logging with workflowid tag

    * added domain tag to logger

    * metric completed

    * addressing comments

    * fix lint

    * Revert "fix lint"

    This reverts commit 1e96944.

    * fix lint second attempt

    ---------

    Co-authored-by: Allen Chen <allenchen2244@uber.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants