Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Docker] Docker image as runtime fails on GCP #3934

Closed
Michaelvll opened this issue Sep 11, 2024 · 4 comments
Closed

[Docker] Docker image as runtime fails on GCP #3934

Michaelvll opened this issue Sep 11, 2024 · 4 comments
Labels

Comments

@Michaelvll
Copy link
Collaborator

A user reported that even with pytorch default docker image as runtime, it could fail to launch on GCP. We should investigate that.

E 08-15 12:23:17 subprocess_utils.py:84] /bin/sh: -c: line 0: syntax error near unexpected token `('
E 08-15 12:23:17 subprocess_utils.py:84] /bin/sh: -c: line 0: 
@cblmemo
Copy link
Collaborator

cblmemo commented Sep 11, 2024

Is there a reproducable image id for this bug?

@Michaelvll
Copy link
Collaborator Author

Can we try out the default pytorch docker image?

@cblmemo
Copy link
Collaborator

cblmemo commented Sep 16, 2024

sky launch --cloud gcp --gpus T4 --image-id docker:pytorch/pytorch works good for me on latest master. Lemme try some image with default /bin/sh instead in the related issue.

$ sky launch --cloud gcp --gpus T4 --image-id docker:pytorch/pytorch
I 09-16 09:58:37 optimizer.py:719] == Optimizer ==
I 09-16 09:58:37 optimizer.py:730] Target: minimizing cost
I 09-16 09:58:37 optimizer.py:742] Estimated cost: $0.6 / hour
I 09-16 09:58:37 optimizer.py:742] 
I 09-16 09:58:37 optimizer.py:867] Considered resources (1 node):
I 09-16 09:58:37 optimizer.py:937] ---------------------------------------------------------------------------------------------
I 09-16 09:58:37 optimizer.py:937]  CLOUD   INSTANCE       vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN   
I 09-16 09:58:37 optimizer.py:937] ---------------------------------------------------------------------------------------------
I 09-16 09:58:37 optimizer.py:937]  GCP     n1-highmem-4   4       26        T4:1           us-central1-a   0.59          ✔     
I 09-16 09:58:37 optimizer.py:937] ---------------------------------------------------------------------------------------------
I 09-16 09:58:37 optimizer.py:937] 
Launching a new cluster 'sky-73cd-txia'. Proceed? [Y/n]: 
I 09-16 09:58:37 cloud_vm_ray_backend.py:4397] Creating a new cluster: 'sky-73cd-txia' [1x GCP(n1-highmem-4, {'T4': 1}, image_id={'us-central1': 'docker:pytorch/pytorch'})].
I 09-16 09:58:37 cloud_vm_ray_backend.py:4397] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 09-16 09:58:40 cloud_vm_ray_backend.py:1314] To view detailed progress: tail -n100 -f /home/txia/sky_logs/sky-2024-09-16-09-58-35-756478/provision.log
I 09-16 09:58:43 provisioner.py:65] Launching on GCP us-central1 (us-central1-a)
I 09-16 10:01:45 provisioner.py:450] Successfully provisioned or found existing instance.
I 09-16 10:05:58 provisioner.py:552] Successfully provisioned cluster: sky-73cd-txia
I 09-16 10:05:58 cloud_vm_ray_backend.py:3406] Run commands not specified or empty.
I 09-16 10:05:58 cloud_vm_ray_backend.py:3450] 
I 09-16 10:05:58 cloud_vm_ray_backend.py:3450] Cluster name: sky-73cd-txia
I 09-16 10:05:58 cloud_vm_ray_backend.py:3450] To log into the head VM: ssh sky-73cd-txia
I 09-16 10:05:58 cloud_vm_ray_backend.py:3450] To submit a job:         sky exec sky-73cd-txia yaml_file
I 09-16 10:05:58 cloud_vm_ray_backend.py:3450] To stop the cluster:     sky stop sky-73cd-txia
I 09-16 10:05:58 cloud_vm_ray_backend.py:3450] To teardown the cluster: sky down sky-73cd-txia
Clusters
NAME                          LAUNCHED    RESOURCES                                                                  STATUS   AUTOSTOP  COMMAND                       
sky-73cd-txia                 < 1 sec     1x GCP(n1-highmem-4, {'T4': 1}, image_id={'us-central1': 'docker:pytor...  UP       -         sky launch --cloud gcp --...  
sky-344a-txia                 4 days ago  1x Azure(Standard_NV18ads_A10_v5, {'A10': 0.5})                            STOPPED  -         sky exec sky-344a-txia sl...  
sky-jobs-controller-4a0782e9  1 week ago  1x GCP(n2-standard-8, disk_size=50)                                        STOPPED  10m       sky jobs launch -n t-mana... 

@cblmemo
Copy link
Collaborator

cblmemo commented Sep 16, 2024

This should be fixed by #3867. Closing now

@cblmemo cblmemo closed this as completed Sep 16, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants