Skip to content

SkyPilot v0.7.0

Latest
Compare
Choose a tag to compare
@romilbhardwaj romilbhardwaj released this 02 Nov 01:29
3f62588

SkyPilot v0.7.0: 3x faster, reservation support, observability, admin policies, new AI hardware, new UX, and more!

We are excited to announce the release of SkyPilot v0.7.0! This release brings significant performance improvements and many new features:

  • Upto 3x faster provisioning
  • Reservation support: AWS Capacity Reservations, AWS Capacity Blocks, GCP reservations, GCP Dynamic Workload Scheduler (DWS), and more
  • Observability features
  • Admin policy enforcement
  • Support for H100 Mega, TPU v6, TPU v5, gVNIC, azure blob storage, faster disks, and more
  • New UX for sky CLI

and many bug fixes and enhancements!

Release Highlights

Performance

We have made 2-3x performance improvements across cloud providers through optimizations in our provisioning stack and the images we use.

Cloud Provisioning Time Speedup
AWS 1 min 10s 3x
GCP 1 min 15s 3x
Azure 2 min 16s 2x
Kubernetes 52s 2.5x

Reservations

SkyPilot now supports short-term and long-term reservations across clouds:

  • AWS Capacity Reservations
  • AWS Capacity Blocks
  • GCP reservations
  • GCP Dynamic Workload Scheduler (DWS)
  • Bring your own VMs or Kubernetes clusters

SkyPilot's failover includes these reservations, so they can be combined with spot instances or any other resources/clouds to create a resilient and cost-effective infrastructure.

Observability on Kubernetes

SkyPilot now has two new observability features on Kubernetes:

  • sky status --kubernetes shows all SkyPilot resources on the cluster. (#4040, #4079)
    $ sky status --cloud kubernetes
    Kubernetes cluster state (context: mycluster)
    SkyPilot clusters
    USER     NAME                           LAUNCHED    RESOURCES                                  STATUS
    alice    infer-svc-1                    23 hrs ago  1x Kubernetes(cpus=1, mem=1, {'L4': 1})    UP
    alice    sky-jobs-controller-80b50983   2 days ago  1x Kubernetes(cpus=4, mem=4)               UP
    alice    sky-serve-controller-80b50983  23 hrs ago  1x Kubernetes(cpus=4, mem=4)               UP
    bob      dev                            1 day ago   1x Kubernetes(cpus=2, mem=8, {'H100': 1})  UP
    bob      multinode-dev                  1 day ago   2x Kubernetes(cpus=2, mem=2)               UP
    bob      sky-jobs-controller-2ea485ea   2 days ago  1x Kubernetes(cpus=4, mem=4)               UP
    
    Managed jobs
    In progress tasks: 1 STARTING
    USER     ID  TASK  NAME      RESOURCES   SUBMITTED   TOT. DURATION  JOB DURATION  #RECOVERIES  STATUS
    alice    1   -     eval      1x[CPU:1+]  2 days ago  49s            8s            0            SUCCEEDED
    bob      4   -     pretrain  1x[H100:4]  1 day ago   1h 1m 11s      1h 14s        0            SUCCEEDED
    bob      3   -     bigjob    1x[CPU:16]  1 day ago   1d 21h 11m 4s  -             0            STARTING
    bob      2   -     failjob   1x[CPU:1+]  1 day ago   54s            9s            0            FAILED
    bob      1   -     shortjob  1x[CPU:1+]  2 days ago  1h 1m 19s      1h 16s        0            SUCCEEDED
    
  • sky show-gpus --cloud kubernetes shows detailed GPU availability information on the cluster. (#3816, #4085)
    $ sky show-gpus --cloud kubernetes
    Kubernetes GPUs
    GPU   REQUESTABLE_QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
    L4    1, 2, 4                   8           8
    H100  1, 2, 4, 8                16          16
    
    Kubernetes per node GPU availability
    NODE_NAME                  GPU_NAME  TOTAL_GPUS  FREE_GPUS
    my-cluster-0               L4        4           4
    my-cluster-1               L4        4           4
    my-cluster-2               H100      8           8
    my-cluster-3               H100      8           8
    

Admin policy enforcement

SkyPilot has a new admin policy mechanism (#3966) that admins can use to enforce policies on users’ SkyPilot usage. These policies apply custom validation and mutation logic to a user’s tasks and SkyPilot config.

Example policies:

Azure Blob Storage support

In addition to S3, GCS and R2, you can now use Azure Blob Storage as a storage backend for storing and accessing data. (#3032)

New AI hardware support

  • New accelerators: TPU v6 (#4115), TPU v5 (#3814), H100 Mega (#4099),
  • Faster networking on GCP with gVNIC (#4095)
  • Faster disks: new disk tier ultra (#3860) for GCP and AWS.

UX revamp

SkyPilot CLI is cleaner, simpler and even easier to parse now (#4023)

New LLM Recipes

Deprecation Notice

  • All SKY_* environment variables are deprecated in favor of SKYPILOT_* variables.
    • All SKY_* variables will be removed in v0.9.0.
    • See docs for list of currently supported variables.

Backend

New Features

  • Managed jobs can now recover from job-level failures (e.g., GPU errors, non-zero exit codes, etc.) (#3919)
    • Set max_restarts_on_errors to specify the number of times SkyPilot should try to restart the job.
    resources:
      job_recovery:
          max_restarts_on_errors: 3  # Retry 3 times before marking the job as failed
    
  • Nvidia GPUs can now disable ECC (#3676)
  • New environment variable SKYPILOT_NUM_NODES to fetch the number of nodes in the cluster. (#3656)
  • SkyPilot config can now be overridden in the task definition with experimental.config_override (#3689)
    experimental:
      config_override:
        docker:
          run_options: ...
        kubernetes:
          pod_config: ...
          provision_timeout: ...
        gcp:
          managed_instance_group: ...
        nvidia_gpus:
          disable_ecc: ...
    

Enhancements

  • SSH keys AddKeysToAgent for ssh config file and ssh cmd #3985
  • SkyPilot runtime is now installed in a separate conda environment, reducing interference with user's environment. (#3639)
    • Similarly, the environment pre-configured in your docker image is no longer shadowed by SkyPilot's runtime environment (#3874, #3867)
  • docker.run_options now allows users to pass additional options when running docker containers. (#3682)

Fixes

  • Fix sky cancel not terminating all child processes (#3919)
  • Fix provisioning failures when multiple versions of SkyPilot are installed (#3866)
  • Shell autocomplete installation is now more robust (#3892, #3893)

Kubernetes

New Features

  • Observability improvements:
    • sky status --cloud kubernetes shows all SkyPilot resources on the Kubernetes cluster. (#4040, #4079)
    • sky show-gpus --cloud kubernetes shows detailed GPU availability information on the cluster. (#3816, #4085)
  • SkyPilot now helps you set up your clusters for running SkyPilot jobs.
  • SkyPilot job output is now piped to the container logs (#3758)
    • Use your existing logging tooling (kubectl logs, filebeat, etc.) to view SkyPilot job outputs.
  • Support for Nvidia GPU operator labels (nvidia.com/gpu.product) for detecting GPU types. (#3493)
    • You no longer need to label GPUs if you have the Nvidia GPU operator installed.
  • Spot instances are now supported on GKE clusters (#3675)
  • [Experimental] Multi-context support (#3913, #3968, #3897, #3772, #4013)

Performance improvements:

  • New command runner: 3x faster command submission for Kubernetes pods. (#3157)
  • sky local up for GPUs is now ~5x faster, provisioning in 2min 30s instead of 12min (#3664)
  • Our GPU images are now 3x smaller (1.5 GB), reducing the time to pull the image (#3665)
  • SSH jump pod is no longer required for port-forward mode (#3657)
  • SSH setup is now parallelized to speed up multi-node provisioning (#4158)

Enhancements and fixes

  • H100 Mega support on GKE (#3891, #3627)
  • Better handling for context names with special characters (#4147)
  • --k8s is now a valid alias for --cloud kubernetes (#4151)
  • Init containers are now supported on Kubernetes (#3762)
  • Auth: robust service account support and updated docs on minimal permissions (#3632)
  • Custom metadata annotations are now propagated to services, allowing configuration of internal load balancer services on cloud hosted Kubernetes clusters (#3767)
  • Provisioning errors are now surfaced clearly (#3590, #3795, #3821)
  • Cluster attributes (autodown, idle-minutes-to-autostop) are now added as annotations to the pod (#3870)
  • SkyServe controller is now automatically terminated when all replicas are terminated. (#3984)
  • Create namespace permission is no longer required in cluster launch flow (#3714)
  • If your cluster does not support apparmor, SkyPilot will now retry without requesting it. (#4176)

Cloud: GCP

New Features

  • New accelerators supported:
  • Dynamic Workload Scheduler (DWS) support (#3574, #3835)
    • DWS helps get better availability on GCE through queuing and reservations.
  • Faster pd-extreme disks with disk_tier: ultra (#3860)
  • New config gcp.force_enable_external_ips to force enable external IPs (#3699)
    • This is useful when communication within a VPC is desired and the VM needs to make calls to the public internet.
  • TPU VMs can now run docker containers (#4115)

Enhancements

Cloud: AWS

New Features

  • Capacity blocks and capacity reservations are now supported. (#3852, #3853)
    • You don’t have to wake up at 4:30am PDT to launch your job on a newly available capacity block: SkyPilot will wait for you until the start time of the capacity block.
  • Faster io2 disks with disk_tier: ultra (#3860)
  • Security groups: you can now specify security groups for your resources at a finer granularity. (#3501)
  • SkyPilot can now use encrypted EBS volumes (#3765)

Enhancements

  • Performance: provisioning now 3x faster on AWS (#4091)
  • Buckets created by SkyPilot are now tagged with labels specified in ~/.sky/config.yaml (#3922)
  • Label validation now handles : and other special characters. (#3734)

Cloud: Azure

New Features

  • You can now use any Azure community image with --image-id (#4145)
  • Azure Blob Storage is now supported (#3032, #3796, #3807)
  • Fractional A10 instance types are now supported (#3877)
  • You can now specify resource group for Azure instance provisioning (#3764)
  • Faster Premium_LRS disks with disk_tier: high (#3921)

Enhancements

  • Performance: provisioning is now 2x faster on Azure with our new provisioner and custom images (#3697, #3704, #3696, #3700, #4139, #4167, #4205)
  • Improved support for A10 GPUs (#3707)
  • Azure resource group is now waited to be deleted instead of erroring out (#3712)

SkyServe

  • Readiness probe timeout can now be set in the service spec (#3472)
  • You can now tear down a specific replica with sky serve down --replica-id (#4032)
  • SkyServe controller region is now chosen from the replica resources (#4053)

Storage

  • Azure Blob Storage is now supported. (#3032, #3796, #3807)
  • .skyignore support (#4038)
    • You can now add files to a .skyignore file to skip uploading them to cloud storage.
  • GCSFuse is updated to 2.2.0, bringing better performance and reliability. (#3619)

Other clouds

  • Lambda Cloud support has been migrated to our new and more reliable provisioner (#3865, #3889)
  • Lambda Cloud now supports docker images (#4115)
  • CUDO now supports opening ports (#3717)
  • RunPod now supports opening ports (#3748) and custom docker images (#3728).
  • FluidStack provisioning has been updated to their new API (#3799)
  • Paperspace now supports A4000 and P4000 GPUs (#3991)
  • OCI: bug fixes and improvements (#4074, #4080)

Thanks to all contributors!

New contributors: @winglian, @Ultramann, @jucor, @BitPhinix, @sethkimmel3, @hyoxt121, @BabyChouSr, @wizenheimer, @gurcangercek, @shashank2000, @ckgresla, @bernardwin, @kmushegi, @Conless, @JayThomason, @colinjc, @mtaran, @Haijian06, @KrishivPiduri, @zpoint

Many thanks to all contributors who contributed to this release!

Contributors: @Michaelvll, @romilbhardwaj, @cblmemo, @landscapepainter, @asaiacai, @andylizf, @yika, @concretevitamin, @colinjc, @fozziethebeat, @MaoZiming, @JGSweets, @Ultramann, @Conless, @jucor, @wizenheimer, @Haijian06, @HysunHe, @gurcangercek, @bernardwin, @JungleCatSW, @BabyChouSr, @hyoxt121, @winglian, @sethkimmel3, @mjibril, @shashank2000, @ckgresla, @zpoint, @mtaran, @KrishivPiduri, @JayThomason, @BitPhinix, @kmushegi

Full Changelog: v0.6.0...v0.7.0