-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
/v1/jobs/statuses does not return all matching jobs / has inconsistent job lists compared with nomad status #23895
Comments
I can confirmed that the problem exist in 1.8.3 version. |
Hi @shoeffner and thanks for raising this issue with the awesome information. I have not been able to reproduce this yet locally, so I will move this onto our board, marking that we need to investigate it further. I tried locally using 9 jobs, spread across 2 namespace, some with matching job names:
When I run a query, filtering for either job name I get the results I would expect:
The UI also responds and filters correctly. I wonder if this is something that needs to be reproduced and is only present at a certain scale of job number. |
We have also seen this in Nomad 1.9.1+ent for jobs with the same name, but in different namespaces. The UI shows only one of them, but the CLI will show both of them. The cluster we've seen this on is running ~150 jobs total. The cluster is also ~75% EC2 Spot instances, so we get a good amount of churn with things being moved about/redeployed/etc. as nodes are replaced (thinking maybe this could factor into things). Annoyingly it seems intermittent too, going back to check after writing this and the UI is showing all jobs correctly. |
Nomad version
CLI (for this report tested with MacOS, but reports from within our institute imply it's the same for Windows and Linux):
Server/Client (Note that we apply an unrelated patch for docker volume mounts and build nomad ourselves, so the revision might not be accurate for 1.8.2):
Operating system and Environment details
CLI: MacOS, Windows, Linux
Server/Client: Linux
Issue
The new /v1/jobs/statuses endpoint filters out too many jobs. We have multiple batch jobs which run for a week or so, and they are often named by usernames, some hashes, and some numbers. I modified the username to match mine for privacy reasons, but the gist is the same (and not unique to these name patterns, as far as I can tell):
curl -s -H "X-Nomad-Token: ${NOMAD_TOKEN}" "${NOMAD_ADDR}/v1/jobs/statuses?filter=Name%20contains%20\"shoeffner-\"&namespace=*&per_page=25"
| jq '.[].Name'
:nomad status | grep shoeffner-
:I expect all three jobs to show up in the UI. Our users found workarounds in the UI by checking the evaluations or randomly clicking through the clients to navigate to the jobs (basically navigation paths where other APIs are used). We recommend using the CLI right now, but most users prefer using the UI.
While checking the code for possible changes, I stumbled over 7d3ce7e, which says that namespace filtering was not done correctly without a management token but a) two jobs were found and b) even with a management token, this issue stays.
The update notes for 1.8.3 also make no mention of an issue like this, so I assume it is still the case for 1.8.3.
Reproduction steps
I don't know yet how to properly reproduce this, maybe just run a few jobs with similar names? Either way, I tried running /jobs/statuses without the extra parameters and it still does not return all jobs:
while the
nomad status | grep shoeffner-
lists all three of them as shown above.I also tried ?page_size=500 to ensure that the pagination is not the problem (at least not on the caller's side), because in the past with different software I had similar issues where pagination was applied before filtering, causing random records to disappear; but this seems not to be the case or the pagination has a limit and that gets overwritten.
I tried to at least see that the numbers of returned jobs in general differ, and yes, they do:
So I tried to check for the differences in the job list and to my surprise, both lists contain a few jobs which are not in the other list:
It could be that some of the jobs are in different states and the two commands filter different ones by default; since I don't know the exact details of the commands, this is to be taken with a grain of salt. Still, especially the three jobs in question should all match the same filter, are all in the same state (up and running) etc. – plus they show up in
nomad status
, but not all in/v1/job/statuses
.Expected Result
and
should contain the same jobs, and the filters for
/v1/jobs/statuses
should match all matching jobs.Actual Result
See above: the relevant job is not part of the output and the job counts differ.
Job file (if appropriate)
n/a
Nomad Server logs (if appropriate)
no specific log output available
Nomad Client logs (if appropriate)
no specific log output available
The text was updated successfully, but these errors were encountered: