investigate reliability of connector syncs on m1 k8s #10122

jrhizor · 2022-02-05T00:53:38Z

I'm running into something strange on M1 using the published PokeAPI + JSON Destination images on a locally built platform.

If I run on docker-compose I can run it as many times as I want and see successful runs (even 180+ in a row). However, if I run on kubernetes, there's a chance of the syncs getting stuck. Specifically what happens is that the source completes and closes but the destination never completes. The destination's remote-stdin closes but the pod is just hanging. This doesn't really look like a platform pod orchestration problem (?) because the actual main loop just sits there indefinitely, and the platform just waits as expected. This happens fairly reliably (I almost always run into this by 10 runs of this sync).

This transient behavior of getting stuck happens on both the non-container orchestrator and container orchestrator approaches with similar frequencies and in the same place.

I haven't attempted to replicate on other connectors (that would point to where the issue is better). I think the issue must be either M1-specific for docker desktop or a race condition in the entrypoint. The only reason I think it may be M1 specific is because we haven't seen this precise issue in builds (linux amd64) (worth confirming) or cloud (linux amd64).

The remote-stdin container only logs the following before closing:

❯ kubectl logs -f destination-local-json-sync-6-0-zqark remote-stdin
2022/02/04 23:43:02 socat[1] W ioctl(5, IOCTL_VM_SOCKETS_GET_LOCAL_CID, ...): Function not implemented

but that is also logged on succesful runs.

I think the first steps would be to run the syncs 10-20x on:

run with non-json destinations on M1 docker-for-desktop to see if this is a destination-specific problem (seems unlikely but worth ruling out)
ec2 build runners kubernetes
docker-for-desktop kubernetes on non-M1
run minikube or any other non-docker-desktop/kind version on M1

No matter which environment we narrow it down to, we'll need to probably add diagnostic logging in the entrypoint to try to find exactly where it's getting stuck in the termination.

I'm putting this at high priority since it's a non-trivial barrier to trusting the result of platform code changes locally when developing on M1.

The text was updated successfully, but these errors were encountered:

davinchia · 2022-06-02T08:54:19Z

duplicate of #2017

jrhizor added type/bug Something isn't working priority/medium Medium priority area/platform issues related to the platform needs-triage labels Feb 5, 2022

jrhizor changed the title ~~investigate reliability of connector syncs on m1~~ investigate reliability of connector syncs on m1 k8s Feb 5, 2022

bleonard added the team/platform-move label Apr 15, 2022

davinchia closed this as completed Jun 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate reliability of connector syncs on m1 k8s #10122

investigate reliability of connector syncs on m1 k8s #10122

jrhizor commented Feb 5, 2022 •

edited

Loading

davinchia commented Jun 2, 2022

investigate reliability of connector syncs on m1 k8s #10122

investigate reliability of connector syncs on m1 k8s #10122

Comments

jrhizor commented Feb 5, 2022 • edited Loading

davinchia commented Jun 2, 2022

jrhizor commented Feb 5, 2022 •

edited

Loading