Skip to content

Latest commit

 

History

History
149 lines (111 loc) · 7.63 KB

README.md

File metadata and controls

149 lines (111 loc) · 7.63 KB

💹 Auto-Scaling Endpoints Based on CPU Usage

Autoscaling your FastAPI apps has significant benefits like improved performance, optimized resource usage, and cost-effectiveness. It lets you efficiently manage traffic spikes and varying loads, ensuring your application remains responsive at all times.

The fastapi-serve library has built-in support for auto-scaling based on CPU usage. You can configure the CPU threshold for scaling up and down and the maximum number of replicas by specifying them in a jcloud.yml file and using the --config flag to give it to the deployment.

# jcloud.yml
instance: C3
autoscale:
  min: 1
  max: 2
  metric: cpu
  target: 40

The above configuration will scale the app up to a maximum of 2 replicas when CPU usage exceeds 40%, and scale it down to 1 replica when the CPU usage falls below 40%.

Let's look at an example of how to auto-scale a FastAPI app based on CPU usage.

📈 Deploy a FastAPI app with auto-scaling based on CPU usage

This directory contains the following files:

.
├── main.py             # The FastAPI app
├── jcloud.yml          # JCloud deployment config with the autoscaling config
└── README.md           # This README file
# main.py
import os
import time

from fastapi import FastAPI
from pydantic import BaseModel, Field

app = FastAPI()

class Response(BaseModel):
    cpu_time: float
    result: int
    hostname: str = Field(default_factory=lambda: os.environ.get("HOSTNAME", "unknown"))

def _heavy_compute(count):
    sum = 0
    for i in range(count):
        sum += i
    return sum

@app.get("/load/{count}", response_model=Response)
def load_test(count: int = 1_000_000):
    _t1 = time.time()
    _sum = _heavy_compute(count)
    _t2 = time.time()
    _cpu_time = _t2 - _t1
    print(f"CPU time: {_cpu_time}")
    return Response(cpu_time=_cpu_time, result=_sum)

In the above example, we have a /load endpoint that performs a CPU-intensive task. We will use this endpoint to simulate a CPU-intensive workload.

🚀 Deploying to Jina AI Cloud

fastapi-serve deploy jcloud main:app
╭─────────────────────────┬─────────────────────────────────────────────────────────────────────────────╮
│ App ID                  │                             fastapi-2a94b25a5f                              │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────────────┤
│ Phase                   │                                   Serving                                   │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────────────┤
│ Endpoint                │                   https://fastapi-2a94b25a5f.wolf.jina.ai                   │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────────────┤
│ App logs                │                           https://cloud.jina.ai/                            │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────────────┤
│ Base credits (per hour) │                      10.104 (Read about # here)                       │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────────────┤
│ Swagger UI              │                https://fastapi-2a94b25a5f.wolf.jina.ai/docs                 │
├─────────────────────────┼─────────────────────────────────────────────────────────────────────────────┤
│ OpenAPI JSON            │            https://fastapi-2a94b25a5f.wolf.jina.ai/openapi.json             │
╰─────────────────────────┴─────────────────────────────────────────────────────────────────────────────╯

💻 Testing

Let's send a few requests to the /load endpoint to simulate a not-so-intense workload.

curl -sX GET https://fastapi-2a94b25a5f.wolf.jina.ai/load/1000000 | jq
{
  "cpu_time": 0.4925811290740967,
  "result": 499999500000,
  "hostname": "gateway-00001-deployment-85589655bb-pn7b4"
}

This finishes in about 49ms. Let's send one request with an intense workload.

curl -sX GET https://fastapi-2a94b25a5f.wolf.jina.ai/load/10000000000 | jq

While the request is being processed, you can see the CPU usage in the CPU graph. It will go above 40%, and the app will be scaled up to 2 replicas. Meanwhile, let's open another terminal and send a few more requests to the /load endpoint in a loop.

for i in {1..1000}; do curl -sX GET https://fastapi-2a94b25a5f.wolf.jina.ai/load/1000000 | jq; sleep 0.5; done

Eventually, you will see that requests are being served by 2 replicas (indicated in the hostname field in the response).

{
  "cpu_time": 0.11650848388671875,
  "result": 499999500000,
  "hostname": "gateway-00001-deployment-85589655bb-pn7b4"
}
{
  "cpu_time": 0.1402430534362793,
  "result": 499999500000,
  "hostname": "gateway-00001-deployment-85589655bb-gr6sc"
}

Note: You might see a message saying "The upstream server is timing out" during long-running requests. This can be configured with the timeout field in the jcloud.yml file. By default, requests will time out after 120 seconds.

📊 Observe the CPU usage

To view the CPU usage, you can go Jina AI Cloud. Click on the fastapi-2a94b25a5f app and then click on the Charts tab. You can see the CPU usage in the CPU graph.

Example CPU Usage

🎯 Wrapping Up

As we've seen in this example, CPU-based autoscaling can be a game changer for FastAPI applications. It helps to efficiently manage your resources, handle traffic spikes, and maintain a responsive application under heavy workloads. fastapi-serve makes it straightforward to leverage autoscaling, helping you to build highly scalable, efficient, and resilient FastAPI applications with ease. Embrace the power of autoscaling with fastapi-serve today!