Skip to content

Respect WorkflowClientInterceptors in newUntypedWorkflowStub(WorkflowExecution, ...) #813

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 4 commits into from
Oct 13, 2021

Conversation

nagl-stripe
Copy link
Contributor

What was changed

This PR contains a bug report in the form of a test case, and also a proposed fix.

WARNING: This is only a suggested fix. Be paranoid.

Given some of the @Deprecated warnings around these calls, I'm not sure if the fix I made is the right one or not. Maybe the problem is actually somewhere in TestWorkflowEnvironmentInternal? Maybe it's in TimeLockingInterceptor itself? (Perhaps it should overload workflowClientCallsInterceptor?).

Context

I have a java user at Stripe who is writing tests for a flow that starts a long-lived workflow, stores its wf-id/run-id, and checks up on it later. I think they've found a bug around TestWorkflowEnvironment, where timeskipping is enabled or not based on which overload of newUntypedWorkflowStub you use.

Why?

This change makes it so that all 3 overloads of newUntypedWorkflowStub wire in the WorkflowClientInterceptors. This is particularly important for the TestWorkflowEnvironment, which uses an interceptor to enable timeskipping.

Checklist

  1. Closes (...I didn't file a bug, but I'm happy to make one, especially if the fix is more involved)

  2. How was this tested:
    ./gradlew test passes, and I added a test to TestWorkflowEnvironmentSleepTest that fails if you remove the fix.

  3. Any docs updates needed?
    No

@Spikhalskiy
Copy link
Contributor

Spikhalskiy commented Oct 12, 2021

@nagl-stripe Thanks for the PR! TestWorkflowEnvironmentSleepTest::timeskippingIsAlwaysOn is failing in buildkite. Let me know if you need details of the failure.

@nagl-stripe
Copy link
Contributor Author

@nagl-stripe Thanks for the PR! TestWorkflowEnvironmentSleepTest::timeskippingIsAlwaysOn is failing in buildkite. Let me know if you need details of the failure.

Yes please - it passes locally.

@Spikhalskiy
Copy link
Contributor

@nagl-stripe


io.temporal.testing.TestWorkflowEnvironmentSleepTest > timeskippingIsAlwaysOn FAILED
    io.temporal.client.WorkflowFailedException: workflowId='3efb9877-d205-4fbe-b593-5e27900315b4', runId='e6f5cca5-4e67-4417-9451-18157391d223', workflowType='ConfigurableSleepWorkflow', retryState=RETRY_STATE_TIMEOUT, workflowTaskCompletedEventId=0
        at io.temporal.internal.common.WorkflowExecutionUtils.getResultFromCloseEvent(WorkflowExecutionUtils.java:121)
        at io.temporal.internal.client.WorkflowClientLongPollHelper.getWorkflowExecutionResult(WorkflowClientLongPollHelper.java:74)
        at io.temporal.internal.client.RootWorkflowClientInvoker.getResult(RootWorkflowClientInvoker.java:93)
        at io.temporal.internal.sync.WorkflowStubImpl.getResult(WorkflowStubImpl.java:243)
        at io.temporal.internal.sync.WorkflowStubImpl.getResult(WorkflowStubImpl.java:234)
        at io.temporal.testing.TestWorkflowEnvironmentInternal$TimeLockingInterceptor$TimeLockingWorkflowStub.getResult(TestWorkflowEnvironmentInternal.java:337)
        at io.temporal.testing.TestWorkflowEnvironmentSleepTest.waitForWorkflow(TestWorkflowEnvironmentSleepTest.java:231)
        at io.temporal.testing.TestWorkflowEnvironmentSleepTest.timeskippingIsAlwaysOn(TestWorkflowEnvironmentSleepTest.java:216)
 
        Caused by:
        io.temporal.failure.TimeoutFailure: timeoutType=TIMEOUT_TYPE_START_TO_CLOSE
            at io.temporal.internal.common.WorkflowExecutionUtils.getResultFromCloseEvent(WorkflowExecutionUtils.java:121)
            ... 7 more

io.temporal.testing.TestWorkflowEnvironmentSleepTest > timeskippingIsAlwaysOn STANDARD_ERROR		
	Stored Workflows:	
	ExecutionId{namespace='default', execution=workflow_id: "3efb9877-d205-4fbe-b593-5e27900315b4"	
	run_id: "e6f5cca5-4e67-4417-9451-18157391d223"	
	}	
		
	event_id: 1	
	event_time { seconds: 1634001160 nanos: 411000000 }	
	event_type: EVENT_TYPE_WORKFLOW_EXECUTION_STARTED	
	workflow_execution_started_event_attributes {	
	workflow_type { name: "ConfigurableSleepWorkflow" }	
	task_queue { name: "EXAMPLE" }	
	input {	
	payloads {	
	metadata {	
	key: "encoding"	
	value: "json/plain"	
	}	
	data: "600000"	
	}	
	}	
	workflow_execution_timeout { seconds: 30 }	
	workflow_run_timeout { seconds: 30 }	
	workflow_task_timeout { seconds: 10 }	
	original_execution_run_id: "e6f5cca5-4e67-4417-9451-18157391d223"	
	identity: "11958@a13065ac7065"	
	attempt: 1	
	header { }	
	}	
		
	event_id: 2	
	event_time { seconds: 1634001160 nanos: 411000000 }	
	event_type: EVENT_TYPE_WORKFLOW_TASK_SCHEDULED	
	workflow_task_scheduled_event_attributes {	
	task_queue { name: "EXAMPLE" }	
	start_to_close_timeout { seconds: 10 }	
	attempt: 1	
	}	
		
	event_id: 3	
	event_time { seconds: 1634001160 nanos: 411000000 }	
	event_type: EVENT_TYPE_WORKFLOW_TASK_STARTED	
	workflow_task_started_event_attributes {	
	scheduled_event_id: 2	
	identity: "11958@a13065ac7065"	
	}	
		
	event_id: 4	
	event_time { seconds: 1634001160 nanos: 423000000 }	
	event_type: EVENT_TYPE_WORKFLOW_TASK_COMPLETED	
	workflow_task_completed_event_attributes {	
	scheduled_event_id: 2	
	identity: "11958@a13065ac7065"	
	}	
		
	event_id: 5	
	event_time { seconds: 1634001160 nanos: 423000000 }	
	event_type: EVENT_TYPE_TIMER_STARTED	
	timer_started_event_attributes {	
	timer_id: "e07d5acd-394c-39c6-9c39-5eef38a79c29"	
	start_to_fire_timeout { seconds: 600 }	
	workflow_task_completed_event_id: 3	
	}	
		
	event_id: 6	
	event_time { seconds: 1634001190 nanos: 411000000 }	
	event_type: EVENT_TYPE_WORKFLOW_EXECUTION_TIMED_OUT	
	workflow_execution_timed_out_event_attributes {	
	retry_state: RETRY_STATE_TIMEOUT	
	}	

@nagl-stripe
Copy link
Contributor Author

The CI failure was just a misalignment between two timeouts. I'd made a last minute change to one, but not both, and hadn't tested after.

@Spikhalskiy Spikhalskiy merged commit bbc40f2 into temporalio:master Oct 13, 2021
@nmeylan
Copy link

nmeylan commented Dec 14, 2021

Hello, after having upgrade from 1.1.0 to 1.6.0 and a long day of debugging, I found out, using git bisect that this PR was leading to one of my test failing.
The test is an integration test where I make an activity been retries 3 times, before succeed, then my workflow await for a signal to be sent from somewhere (where I call workflowClient.newUntypedWorkflowStub(Type, workflowId) ).
It fails because I have following code in my test to await the end of the workflow:

WorkflowStub workflow = testWorkflowClient.newUntypedWorkflowStub(workflowId, Optional.empty(), Optional.empty() );
try {
	return workflow.getResult(workflowReturnType); // await for workflow to complete
} catch(WorkflowFailedException e) {
	// This exception is thrown where => e.getRetryState().equals(RetryState.RETRY_STATE_TIMEOUT)
}

The only way I found to make it pass again, except by removing this commit, was to do add a sleep before the "newUntypedWorkflowStub" where sleep ends after all activities run, i found that useful comment https://github.com/temporalio/sdk-java/blob/v1.5.0/temporal-sdk/src/test/java/io/temporal/workflow/cancellationTests/ActivityCancellationTest.java#L59
But I don't like this option, to put a sleep.

While I understand the goal of this PR, it seems to not work for such use case. Is there a method to remove timeskipping interceptors? maybe it is not related to that
Maybe the right question would be: How am I suppose to do to await the completion of a workflow with running activities?

EDIT: it is fixed in 1.7.0, thanks!

@Spikhalskiy
Copy link
Contributor

Spikhalskiy commented Dec 15, 2021

@nmeylan Right now time-skipping is always on by default for tests, yes. And it was this bug that it wasn't for only this one method of creating the stubs.
Me or @nagl-stripe are going to expose methods that will allow you to disable time-skipping in an upcoming couple of days, stay tuned.

@nagl-stripe
Copy link
Contributor Author

See #937

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants