Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

/v1/jobs/statuses does not return all matching jobs / has inconsistent job lists compared with nomad status #23895

Open
shoeffner opened this issue Aug 30, 2024 · 3 comments

Comments

@shoeffner
Copy link

Nomad version

CLI (for this report tested with MacOS, but reports from within our institute imply it's the same for Windows and Linux):

Nomad v1.8.3
BuildDate 2024-08-13T07:37:30Z
Revision 63b636e5cbaca312cf6ea63e040f445f05f00478

Server/Client (Note that we apply an unrelated patch for docker volume mounts and build nomad ourselves, so the revision might not be accurate for 1.8.2):

Nomad v1.8.2
BuildDate 2024-07-26T12:22:15Z
Revision 919bd4e7602ed1c6e26e865186be6a51f5dc33e1

Operating system and Environment details

CLI: MacOS, Windows, Linux
Server/Client: Linux

Issue

The new /v1/jobs/statuses endpoint filters out too many jobs. We have multiple batch jobs which run for a week or so, and they are often named by usernames, some hashes, and some numbers. I modified the username to match mine for privacy reasons, but the gist is the same (and not unique to these name patterns, as far as I can tell):

  • When I type "shoeffner-" into the search bar of the UI, it calls something more or less equivalent to: curl -s -H "X-Nomad-Token: ${NOMAD_TOKEN}" "${NOMAD_ADDR}/v1/jobs/statuses?filter=Name%20contains%20\"shoeffner-\"&namespace=*&per_page=25"
  • The response contains two results (here with | jq '.[].Name':
    "shoeffner-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_03"
    "shoeffner-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_02"
    
  • But I am missing _01, which shows up when I do nomad status | grep shoeffner-:
     shoeffner-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_01      batch                50        running         2024-08-28T16:39:29+02:00
     shoeffner-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_02      batch                50        running         2024-08-26T09:54:44+02:00
     shoeffner-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_03      batch                50        running         2024-08-29T08:41:29+02:00
    

I expect all three jobs to show up in the UI. Our users found workarounds in the UI by checking the evaluations or randomly clicking through the clients to navigate to the jobs (basically navigation paths where other APIs are used). We recommend using the CLI right now, but most users prefer using the UI.

While checking the code for possible changes, I stumbled over 7d3ce7e, which says that namespace filtering was not done correctly without a management token but a) two jobs were found and b) even with a management token, this issue stays.

The update notes for 1.8.3 also make no mention of an issue like this, so I assume it is still the case for 1.8.3.

Reproduction steps

I don't know yet how to properly reproduce this, maybe just run a few jobs with similar names? Either way, I tried running /jobs/statuses without the extra parameters and it still does not return all jobs:

$ curl -s -H "X-Nomad-Token: ${NOMAD_TOKEN}" "${NOMAD_ADDR}/v1/jobs/statuses?namespace=*" | jq '.[].Name' | grep shoeffner-
"shoeffner-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_03"
"shoeffner-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_02"

while the nomad status | grep shoeffner- lists all three of them as shown above.
I also tried ?page_size=500 to ensure that the pagination is not the problem (at least not on the caller's side), because in the past with different software I had similar issues where pagination was applied before filtering, causing random records to disappear; but this seems not to be the case or the pagination has a limit and that gets overwritten.

I tried to at least see that the numbers of returned jobs in general differ, and yes, they do:

$ NOMAD_NAMESPACE=* nomad status | tail -n +2 | wc -l 
     204
$ curl -s -H "X-Nomad-Token: ${NOMAD_TOKEN}" "${NOMAD_ADDR}/v1/jobs/statuses?namespace=*" | jq -r '.[].Name' | wc -l
     193

So I tried to check for the differences in the job list and to my surprise, both lists contain a few jobs which are not in the other list:

$ LEFT=$(curl -s -H "X-Nomad-Token: ${NOMAD_TOKEN}" "${NOMAD_ADDR}/v1/jobs/statuses?namespace=*" | jq -r '.[].Name' | sort)
$ RIGHT=$(NOMAD_NAMESPACE=* nomad status | tail -n +2 | cut -d' ' -f 1 | sort)
$ comm -13 <(echo $LEFT) <(echo $RIGHT) | wc -l
      20
$ comm -23 <(echo $LEFT) <(echo $RIGHT) | wc -l
       9

It could be that some of the jobs are in different states and the two commands filter different ones by default; since I don't know the exact details of the commands, this is to be taken with a grain of salt. Still, especially the three jobs in question should all match the same filter, are all in the same state (up and running) etc. – plus they show up in nomad status, but not all in /v1/job/statuses.

Expected Result

nomad status -namespace '*'

and

curl -s -H "X-Nomad-Token: ${NOMAD_TOKEN}" "${NOMAD_ADDR}/v1/jobs/statuses?namespace=*"

should contain the same jobs, and the filters for /v1/jobs/statuses should match all matching jobs.

Actual Result

See above: the relevant job is not part of the output and the job counts differ.

Job file (if appropriate)

n/a

Nomad Server logs (if appropriate)

no specific log output available

Nomad Client logs (if appropriate)

no specific log output available

@qk4l
Copy link

qk4l commented Sep 3, 2024

I can confirmed that the problem exist in 1.8.3 version.

@jrasell
Copy link
Member

jrasell commented Sep 18, 2024

Hi @shoeffner and thanks for raising this issue with the awesome information. I have not been able to reproduce this yet locally, so I will move this onto our board, marking that we need to investigate it further.

I tried locally using 9 jobs, spread across 2 namespace, some with matching job names:

$ nomad status -namespace="*"
ID                                                           Namespace  Type     Priority  Status   Submit Date
jrasell-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_01    default    service  50        running  2024-09-18T09:10:39+01:00
jrasell-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_02    default    service  50        running  2024-09-18T09:10:42+01:00
jrasell-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_03    default    service  50        running  2024-09-18T09:11:22+01:00
shoeffner-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_01  default    service  50        running  2024-09-18T09:10:19+01:00
shoeffner-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_01  gh-23895   service  50        running  2024-09-18T09:15:34+01:00
shoeffner-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_02  default    service  50        running  2024-09-18T09:10:23+01:00
shoeffner-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_02  gh-23895   service  50        running  2024-09-18T09:15:37+01:00
shoeffner-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_03  default    service  50        running  2024-09-18T09:10:25+01:00
shoeffner-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_03  gh-23895   service  50        running  2024-09-18T09:16:10+01:00

When I run a query, filtering for either job name I get the results I would expect:

$ curl -s "http://localhost:4646/v1/jobs/statuses?filter=Name%20contains%20\"shoeffner-\"&namespace=*&per_page=25" | jq '.[].Name'
"shoeffner-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_03"
"shoeffner-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_02"
"shoeffner-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_01"
"shoeffner-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_03"
"shoeffner-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_02"
"shoeffner-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_01"

$ curl -s "http://localhost:4646/v1/jobs/statuses?filter=Name%20contains%20\"jrasell-\"&namespace=*&per_page=25" | jq '.[].Name'
"jrasell-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_03"
"jrasell-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_02"
"jrasell-e800b8876ae3bc974ab8bd16f6c1b52281b82f4e-1.2.3_01"

The UI also responds and filters correctly. I wonder if this is something that needs to be reproduced and is only present at a certain scale of job number.

@jrasell jrasell added theme/api HTTP API and SDK issues stage/needs-investigation labels Sep 18, 2024
@jrasell jrasell moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Sep 18, 2024
@t-davies
Copy link
Contributor

t-davies commented Nov 5, 2024

We have also seen this in Nomad 1.9.1+ent for jobs with the same name, but in different namespaces. The UI shows only one of them, but the CLI will show both of them. The cluster we've seen this on is running ~150 jobs total. The cluster is also ~75% EC2 Spot instances, so we get a good amount of churn with things being moved about/redeployed/etc. as nodes are replaced (thinking maybe this could factor into things).

Annoyingly it seems intermittent too, going back to check after writing this and the UI is showing all jobs correctly.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

5 participants