Wait for user data propagation in deployment workflow #6978

dnr · 2024-12-12T14:10:09Z

What changed?

The deployment workflow waits for user data to propagate to all task queue partitions before updating its state.

Why?

We should ensure that the desired dispatch semantics will be in effect on all task queue partitions.

How did you test it?

existing tests, new unit test

ShahabT · 2024-12-13T21:54:52Z

service/matching/user_data_manager.go

+	case <-ctx.Done():
+		return ctx.Err()
+	case err := <-complete:
+		return err


Should we log the error? Also, I'm going to make task for adding metrics and alerts we need them for across the board for effective monitoring of user data processing and deployment WFs.

err can't be anything other than nil here. I just had it return it for future extensibility. this never returns an error that isn't a context error. it's a slightly different semantics from the other long poll stuff that can return "empty"/"nothing yet" as a success response. I can do that if desired but it seems like extra complication?

if the matchingClient call above fails, the client itself will log matching client encountered error, so we don't have to do it. well, it will not log DeadlineExceeded/Canceled, which I think is what we want.

it will log FailedPrecondition, which I want to ignore here. I think we can live with that.

ShahabT · 2024-12-13T22:12:20Z

service/worker/deployment/deployment_workflow.go

@@ -302,11 +311,19 @@ func (d *DeploymentWorkflowRunner) handleSyncState(ctx workflow.Context, args *d
 			}
 		}
 		activityCtx := workflow.WithActivityOptions(ctx, defaultActivityOptions)
-		err = workflow.ExecuteActivity(activityCtx, d.a.SyncUserData, syncReq).Get(ctx, nil)
+		var syncRes deploymentspb.SyncUserDataResponse
+		err = workflow.ExecuteActivity(activityCtx, d.a.SyncUserData, syncReq).Get(ctx, &syncRes)


Do you think we'd need to limit the number of TQs we pass to each Activity call here? It seems 1000 TQs at one time can become very fragile.

yeah, I can add batching in the next PR

ShahabT · 2024-12-13T22:13:01Z

service/worker/deployment/deployment_workflow.go

 		if err != nil {
 			// TODO: if this fails, should we roll back anything?
 			return nil, err
 		}
+		// wait for propagation
+		err = workflow.ExecuteActivity(activityCtx, d.a.CheckUserDataPropagation, &deploymentspb.CheckUserDataPropagationRequest{


Same for this call. In here the problem would be more sever because there is another level of fanout to all partitions.

good point, batching sounds good

ShahabT · 2024-12-13T22:15:37Z

service/worker/deployment/deployment_workflow.go

 		if err != nil {
 			// TODO: if this fails, should we roll back anything?
 			return nil, err
 		}
+		// wait for propagation
+		err = workflow.ExecuteActivity(activityCtx, d.a.CheckUserDataPropagation, &deploymentspb.CheckUserDataPropagationRequest{


Need to path for this change, right?

ugh, I was assuming we wouldn't start using this before these changes or could just reset everything...

how much do we have to be concerned with an in-progress deployment and an activity task from the new worker getting sent to an old worker that doesn't know about it?

actually, we can do both at once: only do this if syncRes.TaskQueueMaxVersions has data. replays and older activity handlers will both have no data there. that seems to work.

dnr added 6 commits December 12, 2024 12:58

check

b087bc5

wait for propagation in deployment wf

84d0270

test fixes

13817b3

Merge branch 'main' of github.com:temporalio/temporal into userdata9

385adbd

rename

a5f8d1f

quotas

eaedf54

ShahabT approved these changes Dec 13, 2024

View reviewed changes

ShahabT reviewed Dec 13, 2024

View reviewed changes

dnr added 3 commits December 16, 2024 17:09

make compatible

d175e1d

small tweak

f05d3ce

add a test, test cleanup

27da26b

dnr marked this pull request as ready for review December 16, 2024 18:15

dnr requested a review from a team as a code owner December 16, 2024 18:15

dnr added 2 commits December 16, 2024 18:41

lint

bcbb44d

Merge branch 'main' of github.com:temporalio/temporal into userdata9

d1f02d8

dnr merged commit e44457b into temporalio:main Dec 16, 2024
49 checks passed

This was referenced Jan 15, 2025

[Feature Request] Remove time.Sleep() in commands.taskqueue_test.go temporalio/cli#741

Open

Disable TestPinnedBehaviorThreeWorkers() test temporalio/sdk-go#1780

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for user data propagation in deployment workflow #6978

Wait for user data propagation in deployment workflow #6978

dnr commented Dec 12, 2024 •

edited

Loading

ShahabT Dec 13, 2024

dnr Dec 16, 2024

ShahabT Dec 13, 2024

dnr Dec 16, 2024

ShahabT Dec 13, 2024

dnr Dec 16, 2024

ShahabT Dec 13, 2024

dnr Dec 16, 2024

Wait for user data propagation in deployment workflow #6978

Wait for user data propagation in deployment workflow #6978

Conversation

dnr commented Dec 12, 2024 • edited Loading

What changed?

Why?

How did you test it?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnr commented Dec 12, 2024 •

edited

Loading