Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Make KFTO pytorch-job MultiNode training mnist test compatible for disconnected environment #297

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

abhijeet-dhumal
Copy link
Contributor

@abhijeet-dhumal abhijeet-dhumal commented Dec 30, 2024

Description

Depends on : #295
Once above PR is merged, this PR can be rebased with main origin
Updates made in this PR : Commit : 170cd07

How Has This Been Tested?

export NOTEBOOK_IMAGE=quay.io/modh/odh-generic-data-science-notebook@sha256:7c1a4ca213b71d342a2d1366171304e469da06d5f15710fab5dd3ce013aa1b73 \
CODEFLARE_TEST_RAY_IMAGE=quay.io/rhoai/ray@sha256:b0e129cd2f4cdea7ad7a7859031357ffd9915410551f94fbcb942af2198cdf78    # CUDA image \
FMS_HF_TUNING_IMAGE=quay.io/modh/fms-hf-tuning@sha256:6f98907f9095db72932caa54094438eae742145f4b66c28d15887d5303ff1186 \
AWS_DEFAULT_ENDPOINT=<cluster-url> \ 
AWS_ACCESS_KEY_ID=<access-key> \
AWS_SECRET_ACCESS_KEY=<secret-key> \ 
AWS_STORAGE_BUCKET=rhoai-dw \  #Storage-bucket-name
AWS_STORAGE_BUCKET_MNIST_DIR=mnist-datasets  #Datasets directory

For running test using AMD GPUs, use ROCM image : quay.io/modh/ray@sha256:db667df1bc437a7b0965e8031e905d3ab04b86390d764d120e05ea5a5c18d1b4

go test -v ./tests/kfto -run TestPyTorchJobMnistCpu
go test -v ./tests/kfto -run TestPyTorchJobMnistWithROCm
go test -v ./tests/kfto -run TestPyTorchJobMnistWithCuda

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Copy link

openshift-ci bot commented Dec 30, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sutaakar for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@abhijeet-dhumal abhijeet-dhumal force-pushed the update-multinode-mnist-test-for-disconnected branch from 170cd07 to f023590 Compare January 3, 2025 12:18
@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review January 3, 2025 12:19
@abhijeet-dhumal
Copy link
Contributor Author

abhijeet-dhumal commented Jan 3, 2025

@@ -298,6 +304,53 @@ func createKFTOPyTorchMnistJob(test Test, namespace string, config corev1.Config
}
}

if storage_bucket_endpoint_exists && storage_bucket_access_key_id_exists && storage_bucket_secret_key_exists && storage_bucket_name_exists && storage_bucket_mnist_dir_exists {
tuningJob.Spec.PyTorchReplicaSpecs["Master"].Template.Spec.Containers[0].Env = []corev1.EnvVar{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach overrides existing env variables.
Can you rather append the AWS env variables to existing env variables?

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants