-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Levant deploy stuck in loop when deployment object is empty #263
Comments
I am facing this issue again. Below is the logs from the deploy command.
And it gets stuck in this forever. From what I can see, it is because the deployment endpoint from Nomad has returned an empty response.
This may also be because Nomad cluster went into some inconsistent state. But would like to see levant get around it. |
Another update. Today, I hit this again even when the |
No, I don't think they are related. Because recently I have also seen the issue even without |
Ok. I know that jrasell has been swamped lately. |
Hi @msvbhat sorry for the delay; as @stevenscg has pointed out I have been pretty busy in my day to day professional job so have let Levant slide a little so apologies for that. I am dedicating some time tomorrow to fix this problem, which I believe will be solved by adding some timeout logic into the code where it checks for a deployment object rather than continually try to get a deployment ID. I am thinking this can be short, maybe 30s before exiting and stating no deployment ID was generated. If you have any thoughts please let me know. |
Hi @jrasell, Thanks for replying. And I understand about the job, So please you don't have to apologise. :) I think timeout login feels easy to implement and would fix the problem at hand. I can also try and send a PR for timeout. But I was trying to understand why a deployment watcher is necessary when there is no change in the job file and no deployment is triggered at Nomad. |
In situations where the deployed evaluations didn't invoke a deployment, the return from the Nomad eval endpoint would include an empty deployment ID. Levant would continue to retry until the deployment ID object was populated, which possibly wouldn't happen causing Levant to get stuck in a loop forever. This change adds a timeout into the function which performs the above work, so that if after 60s no deployment ID has been returned, Levant will exit with a useful message. Closes #263
Fix deploy loop bug when evaluation didn't include a deployment ID.
We had a nomad job which was last deployed couple of months ago. And for some reason, the /v1/job//deployments returned an empty object. So when I try to run deploy the job (with no changes to job definition), levant got stuck in a loop for more than 20 minutes. The command I ran for levant deploy is
levant deploy -ignore-no-changes -var-file <file> job.nomad
levant version
Levant v0.2.5
Date: 2018-10-25T13:22:22Z
Commit: 0514741
Branch: 0.2.5
State: 0.2.5
Summary: 0514741
nomad version
Nomad v0.8.5 (90fbfaba6a6d9af7febc39082b95ed832d8b8bd6)
Debug log outputs from Levant:
I don't have the DEBUG logs output yet (lost them when we did the workaround). But the debug had lot of below output
levant/deploy: Nomad returned an empty deployment for evaluation ; retrying
When I checked the levant code, I observed that the DeploymentID is empty and that is why it got stuck.
https://github.com/jrasell/levant/blob/cc275cb120fda9dfaf20ffaebb36c12305495acb/levant/deploy.go#L446
The steps to reproduce would be to have a job that doesn't have the deployment (not sure how a job can end up in this state). And then try to deploy that job without doing any changes to job definition with
-ignore-no-changes
option.If I am able to reproduce this again in our environment, I will provide more information.
The text was updated successfully, but these errors were encountered: