Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

The task-topology plugin cannot handle the tasks whose name contains - #2939

Closed
loheagn opened this issue Jun 27, 2023 · 4 comments · Fixed by #2940
Closed

The task-topology plugin cannot handle the tasks whose name contains - #2939

loheagn opened this issue Jun 27, 2023 · 4 comments · Fixed by #2940
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@loheagn
Copy link

loheagn commented Jun 27, 2023

What happened:

Assume that we have a kubernetes cluster with two nodes and a simple job with topology annotations:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  annotations:
    volcano.sh/task-topology-affinity: "nginx-worker,ubuntu-worker"
    volcano.sh/task-topology-anti-affinity: "nginx-worker"
  name: example-job-1
spec:
  minAvailable: 5
  schedulerName: volcano
  plugins:
    ssh: []
    svc: []
  tasks:
    - replicas: 2
      name: nginx-worker
      template:
        spec:
          containers:
            - image: nginx
              name: nginx-main
          restartPolicy: OnFailure
    - replicas: 3
      name: ubuntu-worker
      template:
        spec:
          containers:
            - command:
                - sleep
                - inf
              image: ubuntu
              name: ubuntu-main
          restartPolicy: OnFailure

The names of the tasks in the job all contain char -.

And the scheduler config file:

actions: "enqueue, backfill"
tiers:
  - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: task-topology

After applying the job yaml, the task-topology plugin not works and we get the error log like:

I0627 12:21:06.479367   43679 topology.go:342] start to init task topology plugin, weight[1], defined order map[0:4 1:1 2:2 3:3]
I0627 12:21:06.479469   43679 topology.go:286] Job <default/example-job-1-506fbf7a-e9c1-4ae4-b37a-60b1d67f5ee3> affinity key invalid: task nginx-worker do not exist in job <default/example-job-1-506fbf7a-e9c1-4ae4-b37a-60b1d67f5ee3>.
I0627 12:21:06.479489   43679 topology.go:311] Job <default/example-job-1-506fbf7a-e9c1-4ae4-b37a-60b1d67f5ee3> affinity key invalid: task nginx-worker do not exist in job <default/example-job-1-506fbf7a-e9c1-4ae4-b37a-60b1d67f5ee3>.
I0627 12:21:06.479501   43679 topology.go:227] Failed to read task topology from job <default/example-job-1-506fbf7a-e9c1-4ae4-b37a-60b1d67f5ee3> annotations, error: task nginx-worker do not exist in job <default/example-job-1-506fbf7a-e9c1-4ae4-b37a-60b1d67f5ee3>.

And the pods are not scheduled as expect:

image

What you expected to happen:

The task-topology should work well and the nginx-worker pods should be scheduled into different nodes.

How to reproduce it (as minimally and precisely as possible):

As described above.

Anything else we need to know?:

I check the code, and find the code may cause this bug

func affinityCheck(job *api.JobInfo, affinity [][]string) error {
if job == nil || affinity == nil {
return fmt.Errorf("empty input, job: %v, affinity: %v", job, affinity)
}
var taskNumber = len(job.Tasks)
var taskRef = make(map[string]bool, taskNumber)
for _, task := range job.Tasks {
tmpStrings := strings.Split(task.Name, "-")
if _, exist := taskRef[tmpStrings[len(tmpStrings)-2]]; !exist {
taskRef[tmpStrings[len(tmpStrings)-2]] = true
}
}
for _, aff := range affinity {
affTasks := make(map[string]bool, len(aff))
for _, task := range aff {
if len(task) == 0 {
continue
}
if _, exist := taskRef[task]; !exist {
return fmt.Errorf("task %s do not exist in job <%s/%s>", task, job.Namespace, job.Name)
}
if _, exist := affTasks[task]; exist {
return fmt.Errorf("task %s is duplicated in job <%s/%s>", task, job.Namespace, job.Name)
}
affTasks[task] = true
}
}
return nil
}

Environment:

  • Volcano Version: v1.7.0
  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2", GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-17T14:20:07Z", GoVersion:"go1.20.4", Compiler:"gc", Platform:"darwin/amd64"}
    Kustomize Version: v5.0.1
    Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.3", GitCommit:"9e644106593f3f4aa98f8a84b23db5fa378900bd", GitTreeState:"clean", BuildDate:"2023-03-15T13:33:12Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: minikube using hyperkit driver on macOS(Intel)
  • Kernel (e.g. uname -a): Linux volcano-demo 5.10.57 #1 SMP Mon Apr 3 23:35:10 UTC 2023 x86_64 GNU/Linux
@loheagn loheagn added the kind/bug Categorizes issue or PR as related to a bug. label Jun 27, 2023
@loheagn
Copy link
Author

loheagn commented Jun 27, 2023

If this is a bug not a feature, I'd like to submit a PR to fix it.

@hwdef
Copy link
Member

hwdef commented Jun 27, 2023

Yes, this looks like a bug, you are welcome to fix it

@stale
Copy link

stale bot commented Oct 15, 2023

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2023
@Monokaix
Copy link
Member

Monokaix commented Feb 6, 2024

/remove-lifecycle-stale

@stale stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 6, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants