-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
fix(eks): pods become CrashLoopBackOff when using INFERENTIA or TRAINIUM instance type #29651
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pull request linter has failed. See the aws-cdk-automation comment below for failure reasons. If you believe this pull request should receive an exemption, please comment and provide a justification.
A comment requesting an exemption should contain the text Exemption Request
. Additionally, if clarification is needed add Clarification Request
to a comment.
private addNeuronDevicePluginRbac() { | ||
if (!this._neuronDevicePluginRbacClusterRole) { | ||
const clusterRoleFileContents = fs.readFileSync(path.join(__dirname, 'addons', 'neuron-device-plugin-rbac-cluster-role.yaml'), 'utf8'); | ||
const sanitizedClusterRole = YAML.parse(clusterRoleFileContents); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I use parseAllDocuments, I don't need to divide k8s-neuron-device-plugin-rbac.yml
into three files but the return type of parseAllDocuments
is not equal to the return type of parse so addManifest
function cannot handle parsed yaml.
I think divide k8s-neuron-device-plugin-rbac.yml
into three files and use parse
is the simplest solution.
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Exemption Request: I updated |
This PR has been in the CHANGES REQUESTED state for 3 weeks, and looks abandoned. To keep this PR from being closed, please continue work on it. If not, it will automatically be closed in a week. |
This PR has been deemed to be abandoned, and will be automatically closed. Please create a new PR for these changes if you think this decision has been made in error. |
The pull request linter fails with the following errors:
PRs must pass status checks before we can provide a meaningful review. If you would like to request an exemption from the status checks or clarification on feedback, please leave a comment on this PR containing ✅ A exemption request has been requested. Please wait for a maintainer's review. |
Issue # (if applicable)
#29262
Reason for this change
When we use INFERENTIA or TRAINIUM instance type, https://github.com/aws/aws-cdk/blob/main/packages/aws-cdk-lib/aws-eks/lib/addons/neuron-device-plugin.yaml is applied to cluster but Pod become CrashLoopBackOff (detail log #29262 (comment))
The current yaml https://github.com/aws-neuron/aws-neuron-sdk/blob/master/docs/neuron-container-tools/k8s-neuron-device-plugin.yml is File not found now.
aws-cdk/packages/aws-cdk-lib/aws-eks/lib/addons/neuron-device-plugin.yaml
Line 1 in dffedca
Description of changes
Download k8s-neuron-device-plugin.yml and k8s-neuron-device-plugin-rbac.yml from https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-setup.html and copy & paste
Add function to apply yaml file for RBAC
Add unit tests
Update
integ.eks-inference-nodegroup
andinteg.eks-inference
Description of how you validated changes
Checklist
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license