SkyPilot v0.7.0: 3x faster, reservation support, observability, admin policies, new AI hardware, new UX, and more!
We are excited to announce the release of SkyPilot v0.7.0! This release brings significant performance improvements and many new features:
- Upto 3x faster provisioning
- Reservation support: AWS Capacity Reservations, AWS Capacity Blocks, GCP reservations, GCP Dynamic Workload Scheduler (DWS), and more
- Observability features
- Admin policy enforcement
- Support for H100 Mega, TPU v6, TPU v5, gVNIC, azure blob storage, faster disks, and more
- New UX for
sky
CLI
and many bug fixes and enhancements!
Release Highlights
Performance
We have made 2-3x performance improvements across cloud providers through optimizations in our provisioning stack and the images we use.
Cloud | Provisioning Time | Speedup |
---|---|---|
AWS | 1 min 10s | 3x |
GCP | 1 min 15s | 3x |
Azure | 2 min 16s | 2x |
Kubernetes | 52s | 2.5x |
Reservations
SkyPilot now supports short-term and long-term reservations across clouds:
- AWS Capacity Reservations
- AWS Capacity Blocks
- GCP reservations
- GCP Dynamic Workload Scheduler (DWS)
- Bring your own VMs or Kubernetes clusters
SkyPilot's failover includes these reservations, so they can be combined with spot instances or any other resources/clouds to create a resilient and cost-effective infrastructure.
Observability on Kubernetes
SkyPilot now has two new observability features on Kubernetes:
sky status --kubernetes
shows all SkyPilot resources on the cluster. (#4040, #4079)$ sky status --cloud kubernetes Kubernetes cluster state (context: mycluster) SkyPilot clusters USER NAME LAUNCHED RESOURCES STATUS alice infer-svc-1 23 hrs ago 1x Kubernetes(cpus=1, mem=1, {'L4': 1}) UP alice sky-jobs-controller-80b50983 2 days ago 1x Kubernetes(cpus=4, mem=4) UP alice sky-serve-controller-80b50983 23 hrs ago 1x Kubernetes(cpus=4, mem=4) UP bob dev 1 day ago 1x Kubernetes(cpus=2, mem=8, {'H100': 1}) UP bob multinode-dev 1 day ago 2x Kubernetes(cpus=2, mem=2) UP bob sky-jobs-controller-2ea485ea 2 days ago 1x Kubernetes(cpus=4, mem=4) UP Managed jobs In progress tasks: 1 STARTING USER ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS alice 1 - eval 1x[CPU:1+] 2 days ago 49s 8s 0 SUCCEEDED bob 4 - pretrain 1x[H100:4] 1 day ago 1h 1m 11s 1h 14s 0 SUCCEEDED bob 3 - bigjob 1x[CPU:16] 1 day ago 1d 21h 11m 4s - 0 STARTING bob 2 - failjob 1x[CPU:1+] 1 day ago 54s 9s 0 FAILED bob 1 - shortjob 1x[CPU:1+] 2 days ago 1h 1m 19s 1h 16s 0 SUCCEEDED
sky show-gpus --cloud kubernetes
shows detailed GPU availability information on the cluster. (#3816, #4085)$ sky show-gpus --cloud kubernetes Kubernetes GPUs GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS L4 1, 2, 4 8 8 H100 1, 2, 4, 8 16 16 Kubernetes per node GPU availability NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS my-cluster-0 L4 4 4 my-cluster-1 L4 4 4 my-cluster-2 H100 8 8 my-cluster-3 H100 8 8
Admin policy enforcement
SkyPilot has a new admin policy mechanism (#3966) that admins can use to enforce policies on users’ SkyPilot usage. These policies apply custom validation and mutation logic to a user’s tasks and SkyPilot config.
Example policies:
- Add Labels for all Tasks on Kubernetes
- Always Disable Public IP for AWS Tasks
- Use Spot for all GPU Tasks
- Enforce Autostop for all Tasks
Azure Blob Storage support
In addition to S3, GCS and R2, you can now use Azure Blob Storage as a storage backend for storing and accessing data. (#3032)
New AI hardware support
- New accelerators: TPU v6 (#4115), TPU v5 (#3814), H100 Mega (#4099),
- Faster networking on GCP with gVNIC (#4095)
- Faster disks: new disk tier
ultra
(#3860) for GCP and AWS.
UX revamp
SkyPilot CLI is cleaner, simpler and even easier to parse now (#4023)
New LLM Recipes
- Llama 3.1 and Llama 3.2 recipes (#3990, #3779, #3780)
- llm.c training for GPT 2 (#3611)
- Pixtral (#3938, #3940)
- Qwen2-VL and Qwen 2.5 support (#3961, #3959)
- Yi model family support (#3958)
- Nemo GPT (#3743)
- Other examples: Airflow (#3982), AWS Neuron Accelerator (#4020), and Deepspeed with k8s support (#4124)
Deprecation Notice
- All
SKY_*
environment variables are deprecated in favor ofSKYPILOT_*
variables.- All
SKY_*
variables will be removed in v0.9.0. - See docs for list of currently supported variables.
- All
Backend
New Features
- Managed jobs can now recover from job-level failures (e.g., GPU errors, non-zero exit codes, etc.) (#3919)
- Set
max_restarts_on_errors
to specify the number of times SkyPilot should try to restart the job.
resources: job_recovery: max_restarts_on_errors: 3 # Retry 3 times before marking the job as failed
- Set
- Nvidia GPUs can now disable ECC (#3676)
- New environment variable
SKYPILOT_NUM_NODES
to fetch the number of nodes in the cluster. (#3656) - SkyPilot config can now be overridden in the task definition with
experimental.config_override
(#3689)experimental: config_override: docker: run_options: ... kubernetes: pod_config: ... provision_timeout: ... gcp: managed_instance_group: ... nvidia_gpus: disable_ecc: ...
Enhancements
- SSH keys AddKeysToAgent for ssh config file and ssh cmd #3985
- SkyPilot runtime is now installed in a separate conda environment, reducing interference with user's environment. (#3639)
docker.run_options
now allows users to pass additional options when running docker containers. (#3682)
Fixes
- Fix
sky cancel
not terminating all child processes (#3919) - Fix provisioning failures when multiple versions of SkyPilot are installed (#3866)
- Shell autocomplete installation is now more robust (#3892, #3893)
Kubernetes
New Features
- Observability improvements:
- SkyPilot now helps you set up your clusters for running SkyPilot jobs.
- If you already have a list of IPs and their SSH keys,
sky local up
can now automatically set it up as a cluster to be used for running jobs. (#3926) - If you don't have a cluster yet, we provide a simple one-click setup script to deploy VMs with Kubernetes on cloud of your choice (#3929).
- If you already have a list of IPs and their SSH keys,
- SkyPilot job output is now piped to the container logs (#3758)
- Use your existing logging tooling (
kubectl logs
, filebeat, etc.) to view SkyPilot job outputs.
- Use your existing logging tooling (
- Support for Nvidia GPU operator labels (
nvidia.com/gpu.product
) for detecting GPU types. (#3493)- You no longer need to label GPUs if you have the Nvidia GPU operator installed.
- Spot instances are now supported on GKE clusters (#3675)
- [Experimental] Multi-context support (#3913, #3968, #3897, #3772, #4013)
Performance improvements:
- New command runner: 3x faster command submission for Kubernetes pods. (#3157)
sky local up
for GPUs is now ~5x faster, provisioning in 2min 30s instead of 12min (#3664)- Our GPU images are now 3x smaller (1.5 GB), reducing the time to pull the image (#3665)
- SSH jump pod is no longer required for
port-forward
mode (#3657) - SSH setup is now parallelized to speed up multi-node provisioning (#4158)
Enhancements and fixes
- H100 Mega support on GKE (#3891, #3627)
- Better handling for context names with special characters (#4147)
--k8s
is now a valid alias for--cloud kubernetes
(#4151)- Init containers are now supported on Kubernetes (#3762)
- Auth: robust service account support and updated docs on minimal permissions (#3632)
- Custom metadata annotations are now propagated to services, allowing configuration of internal load balancer services on cloud hosted Kubernetes clusters (#3767)
- Provisioning errors are now surfaced clearly (#3590, #3795, #3821)
- Cluster attributes (autodown, idle-minutes-to-autostop) are now added as annotations to the pod (#3870)
- SkyServe controller is now automatically terminated when all replicas are terminated. (#3984)
- Create namespace permission is no longer required in cluster launch flow (#3714)
- If your cluster does not support
apparmor
, SkyPilot will now retry without requesting it. (#4176)
Cloud: GCP
New Features
- New accelerators supported:
- Dynamic Workload Scheduler (DWS) support (#3574, #3835)
- DWS helps get better availability on GCE through queuing and reservations.
- Faster
pd-extreme
disks withdisk_tier: ultra
(#3860) - New config
gcp.force_enable_external_ips
to force enable external IPs (#3699)- This is useful when communication within a VPC is desired and the VM needs to make calls to the public internet.
- TPU VMs can now run docker containers (#4115)
Enhancements
- Provisioning is now 3x faster on GCP (#4027)
- Faster networking support with gVNIC (#4095)
- Upto ~2x faster in pytorch distributed benchmarks
Cloud: AWS
New Features
- Capacity blocks and capacity reservations are now supported. (#3852, #3853)
- You don’t have to wake up at 4:30am PDT to launch your job on a newly available capacity block: SkyPilot will wait for you until the start time of the capacity block.
- Faster
io2
disks withdisk_tier: ultra
(#3860) - Security groups: you can now specify security groups for your resources at a finer granularity. (#3501)
- SkyPilot can now use encrypted EBS volumes (#3765)
Enhancements
- Performance: provisioning now 3x faster on AWS (#4091)
- Buckets created by SkyPilot are now tagged with labels specified in ~/.sky/config.yaml (#3922)
- Label validation now handles
:
and other special characters. (#3734)
Cloud: Azure
New Features
- You can now use any Azure community image with
--image-id
(#4145) - Azure Blob Storage is now supported (#3032, #3796, #3807)
- Fractional A10 instance types are now supported (#3877)
- You can now specify resource group for Azure instance provisioning (#3764)
- Faster
Premium_LRS
disks withdisk_tier: high
(#3921)
Enhancements
- Performance: provisioning is now 2x faster on Azure with our new provisioner and custom images (#3697, #3704, #3696, #3700, #4139, #4167, #4205)
- Improved support for A10 GPUs (#3707)
- Azure resource group is now waited to be deleted instead of erroring out (#3712)
SkyServe
- Readiness probe timeout can now be set in the service spec (#3472)
- You can now tear down a specific replica with
sky serve down --replica-id
(#4032) - SkyServe controller region is now chosen from the replica resources (#4053)
Storage
- Azure Blob Storage is now supported. (#3032, #3796, #3807)
.skyignore
support (#4038)- You can now add files to a
.skyignore
file to skip uploading them to cloud storage.
- You can now add files to a
- GCSFuse is updated to 2.2.0, bringing better performance and reliability. (#3619)
Other clouds
- Lambda Cloud support has been migrated to our new and more reliable provisioner (#3865, #3889)
- Lambda Cloud now supports docker images (#4115)
- CUDO now supports opening ports (#3717)
- RunPod now supports opening ports (#3748) and custom docker images (#3728).
- FluidStack provisioning has been updated to their new API (#3799)
- Paperspace now supports A4000 and P4000 GPUs (#3991)
- OCI: bug fixes and improvements (#4074, #4080)
Thanks to all contributors!
New contributors: @winglian, @Ultramann, @jucor, @BitPhinix, @sethkimmel3, @hyoxt121, @BabyChouSr, @wizenheimer, @gurcangercek, @shashank2000, @ckgresla, @bernardwin, @kmushegi, @Conless, @JayThomason, @colinjc, @mtaran, @Haijian06, @KrishivPiduri, @zpoint
Many thanks to all contributors who contributed to this release!
Contributors: @Michaelvll, @romilbhardwaj, @cblmemo, @landscapepainter, @asaiacai, @andylizf, @yika, @concretevitamin, @colinjc, @fozziethebeat, @MaoZiming, @JGSweets, @Ultramann, @Conless, @jucor, @wizenheimer, @Haijian06, @HysunHe, @gurcangercek, @bernardwin, @JungleCatSW, @BabyChouSr, @hyoxt121, @winglian, @sethkimmel3, @mjibril, @shashank2000, @ckgresla, @zpoint, @mtaran, @KrishivPiduri, @JayThomason, @BitPhinix, @kmushegi
Full Changelog: v0.6.0...v0.7.0