Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

fix: kubelet 1.20 device checkpoint support #62

Merged
merged 1 commit into from
Jan 29, 2021
Merged

Conversation

aisensiy
Copy link
Contributor

@aisensiy aisensiy commented Jan 27, 2021

The checkpoint file /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint format changed in kubelet 1.20 (https://github.com/kubernetes/kubernetes/blob/d72c056260e771e8cd01220203582de0c0015786/pkg/kubelet/cm/devicemanager/checkpoint/checkpoint.go#L33-L43) which will make the current code crash.

The format of file changes from original:

{
  "Data": {
    "PodDeviceEntries": [
      {
        "PodUID": "...",
        "ContainerName": "...",
        "ResourceName": "...",
        "DeviceIDs": [ # <-------- KEY PART
          "GPU-xxxx"
        ],
        "AllocResp": "..."
      }
    ],
    "RegisteredDevices": {
      "nvidia.com/gpu": [
        ...
      ]
    }
  }
}

To:

{
  "Data": {
    "PodDeviceEntries": [
      {
        "PodUID": "...",
        "ContainerName": "...",
        "ResourceName": "...",
        "DeviceIDs": { # <-------- KEY PART
          "0": [
            "GPU-xxxx"
          ]
        },
        "AllocResp": "..."
      }
    ],
    "RegisteredDevices": {
      "nvidia.com/gpu": [
        ...
      ]
    }
  }
}

@mYmNeo
Copy link
Contributor

mYmNeo commented Jan 27, 2021

Thank you for your commit, but you misunderstanding the checkpoint file of gpu-manager, gpu-manager use this file to recover deviceID, it doesn't care about other fields.

@aisensiy
Copy link
Contributor Author

Thank you for your commit, but you misunderstanding the checkpoint file of gpu-manager, gpu-manager use this file to recover deviceID, it doesn't care about other fields.

I am not trying to change the structure of /etc/gpu-manager/checkpoint/gpumanager_internal_checkpoint. I am trying to fix the parse of /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint.

@mYmNeo
Copy link
Contributor

mYmNeo commented Jan 29, 2021

Thank you for your commit, but you misunderstanding the checkpoint file of gpu-manager, gpu-manager use this file to recover deviceID, it doesn't care about other fields.

I am not trying to change the structure of /etc/gpu-manager/checkpoint/gpumanager_internal_checkpoint. I am trying to fix the parse of /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint.

I get your point.

@mYmNeo mYmNeo merged commit c63bbfd into tkestack:master Jan 29, 2021
mYmNeo pushed a commit that referenced this pull request May 10, 2021
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants