Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

fix(eks): pods become CrashLoopBackOff when using INFERENTIA or TRAINIUM instance type #29651

Closed
wants to merge 4 commits into from

Conversation

wafuwafu13
Copy link
Contributor

@wafuwafu13 wafuwafu13 commented Mar 29, 2024

Issue # (if applicable)

#29262

Reason for this change

When we use INFERENTIA or TRAINIUM instance type, https://github.com/aws/aws-cdk/blob/main/packages/aws-cdk-lib/aws-eks/lib/addons/neuron-device-plugin.yaml is applied to cluster but Pod become CrashLoopBackOff (detail log #29262 (comment))

The current yaml https://github.com/aws-neuron/aws-neuron-sdk/blob/master/docs/neuron-container-tools/k8s-neuron-device-plugin.yml is File not found now.

# source: https://github.com/aws/aws-neuron-sdk/blob/master/docs/neuron-container-tools/k8s-neuron-device-plugin.yml

Description of changes

Description of how you validated changes

  • Pass unit tests
  • Pass integ tests

Checklist


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@github-actions github-actions bot added p2 valued-contributor [Pilot] contributed between 6-12 PRs to the CDK labels Mar 29, 2024
@aws-cdk-automation aws-cdk-automation requested a review from a team March 29, 2024 10:24
Copy link
Collaborator

@aws-cdk-automation aws-cdk-automation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request linter has failed. See the aws-cdk-automation comment below for failure reasons. If you believe this pull request should receive an exemption, please comment and provide a justification.

A comment requesting an exemption should contain the text Exemption Request. Additionally, if clarification is needed add Clarification Request to a comment.

@wafuwafu13 wafuwafu13 changed the title fix(aws-eks): Pods become CrashLoopBackOff when using INFERENTIA or TRAINIUM instance type fix(eks): pods become CrashLoopBackOff when using INFERENTIA or TRAINIUM instance type Mar 29, 2024
private addNeuronDevicePluginRbac() {
if (!this._neuronDevicePluginRbacClusterRole) {
const clusterRoleFileContents = fs.readFileSync(path.join(__dirname, 'addons', 'neuron-device-plugin-rbac-cluster-role.yaml'), 'utf8');
const sanitizedClusterRole = YAML.parse(clusterRoleFileContents);
Copy link
Contributor Author

@wafuwafu13 wafuwafu13 Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I use parseAllDocuments, I don't need to divide k8s-neuron-device-plugin-rbac.yml into three files but the return type of parseAllDocuments is not equal to the return type of parse so addManifest function cannot handle parsed yaml.
I think divide k8s-neuron-device-plugin-rbac.yml into three files and use parse is the simplest solution.

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: 91507a4
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@wafuwafu13
Copy link
Contributor Author

Exemption Request: I updated integ.eks-inference-nodegroup and integ.eks-inference

@aws-cdk-automation aws-cdk-automation added pr-linter/exemption-requested The contributor has requested an exemption to the PR Linter feedback. pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member. labels Mar 29, 2024
This was referenced Apr 1, 2024
@aws-cdk-automation
Copy link
Collaborator

This PR has been in the CHANGES REQUESTED state for 3 weeks, and looks abandoned. To keep this PR from being closed, please continue work on it. If not, it will automatically be closed in a week.

@shikha372 shikha372 self-assigned this Apr 22, 2024
@aws-cdk-automation
Copy link
Collaborator

This PR has been deemed to be abandoned, and will be automatically closed. Please create a new PR for these changes if you think this decision has been made in error.

@aws-cdk-automation aws-cdk-automation added the closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. label Apr 27, 2024
@aws-cdk-automation aws-cdk-automation removed the pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member. label Apr 27, 2024
@aws-cdk-automation
Copy link
Collaborator

The pull request linter fails with the following errors:

❌ Fixes must contain a change to an integration test file and the resulting snapshot.

PRs must pass status checks before we can provide a meaningful review.

If you would like to request an exemption from the status checks or clarification on feedback, please leave a comment on this PR containing Exemption Request and/or Clarification Request.

✅ A exemption request has been requested. Please wait for a maintainer's review.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. p2 pr-linter/exemption-requested The contributor has requested an exemption to the PR Linter feedback. valued-contributor [Pilot] contributed between 6-12 PRs to the CDK
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants