[Serve] incorrect/nondeterministic `CALL OK [time]` logs with blocking concurrent requests #48903

paul-twelvelabs · 2024-11-23T16:15:56Z

What happened + What you expected to happen

Deployment logs of the form __CALL__ OK [time] can display the incorrect execution time if the Deployment has synchronous code that blocks the main asyncio event loop. A simple example is the following,

from time import sleep, perf_counter
from ray import serve

@serve.deployment
class BlockingDeployment:
    async def __call__(self):
        # block asyncio event loop
        sleep(1.)


app = BlockingDeployment.bind()
serve.run(app)


# -- benchmark: 5 concurrent requests
import requests
from concurrent.futures import ThreadPoolExecutor

def make_request(_):
    t0 = perf_counter()
    _ = requests.get("http://localhost:8000/")
    print(f">>>> GET {perf_counter() - t0:0.3f} s")


with ThreadPoolExecutor(max_workers=5) as exc:
    print(list(exc.map(make_request, range(5))))

In this toy example, I send 5 concurrent requests to the the BlockingDeployment, which calls time.sleep(1.) and blocks the event loop. I'd expect requests to queue up and take 1, 2, 3, 4, 5s. I see the following output

>>>> GET 1.040 s
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 1010.3ms
>>>> GET 2.037 s
>>>> GET 3.048 s
>>>> GET 4.050 s
>>>> GET 5.058 s
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 5027.5ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 5028.3ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 5027.6ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 5028.1ms

The ">>>> GET" logs show the requests are returning in the correct time, but clearly the __CALL__ logs are mostly wrong: the first correctly says ~1s but all successive logs say ~5s. I'd expect something like

(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 1010.3ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 2027.5ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 3028.3ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 4027.6ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 5028.1ms

Moreover, this output is nondeterministic and may change on successive runs,

(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 3014.8ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 3015.0ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 3015.7ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 5023.1ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 5022.4ms

Versions / Dependencies

Python 3.12
ray==2.38.0

Reproduction script

[see above.]

Issue Severity

Low: It annoys or frustrates me.

The text was updated successfully, but these errors were encountered:

## Why are these changes needed? Adds a feature flag to run sync user-defined methods in a threadpool by default. This matches the existing behavior when using a FastAPI ingress. This should address a lot of user confusion and make it easier to write performant code by default. For example, just sticking a torch model call in a sync method will now provide reasonable performance out of the box. However, there may be some existing user code that is not thread safe, so we need to do a gentle migration. This PR introduces the behavior behind a feature flag and warns users about the upcoming change and how to opt into the new behavior or maintain existing behavior once it does (just adding `async def` will do it). I've opted to set the max thread pool size to `max_ongoing_requests`, which seems like a reasonable policy. If needed we can add a user-facing API for this in the future. TODO before merging: - [x] Get it working for sync generators. - [x] Add warning for default change (people can keep behavior by changing to async def). - [x] Add/update UserCallableWrapper tests. - [x] Add/update some integration tests (verify that request context is set correctly!). - [x] Set maximum thread pool size. ## Related issue number Closes ray-project#44354 Closes ray-project#44403 Closes ray-project#48903 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Connor Sanders <connor@elastiflow.com>

## Why are these changes needed? Adds a feature flag to run sync user-defined methods in a threadpool by default. This matches the existing behavior when using a FastAPI ingress. This should address a lot of user confusion and make it easier to write performant code by default. For example, just sticking a torch model call in a sync method will now provide reasonable performance out of the box. However, there may be some existing user code that is not thread safe, so we need to do a gentle migration. This PR introduces the behavior behind a feature flag and warns users about the upcoming change and how to opt into the new behavior or maintain existing behavior once it does (just adding `async def` will do it). I've opted to set the max thread pool size to `max_ongoing_requests`, which seems like a reasonable policy. If needed we can add a user-facing API for this in the future. TODO before merging: - [x] Get it working for sync generators. - [x] Add warning for default change (people can keep behavior by changing to async def). - [x] Add/update UserCallableWrapper tests. - [x] Add/update some integration tests (verify that request context is set correctly!). - [x] Set maximum thread pool size. ## Related issue number Closes ray-project#44354 Closes ray-project#44403 Closes ray-project#48903 --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: hjiang <dentinyhao@gmail.com>

paul-twelvelabs added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 23, 2024

edoakes mentioned this issue Nov 23, 2024

[serve] Add FF to run sync methods in a threadpool #48897

Merged

13 tasks

jcotant1 added the serve Ray Serve Related Issue label Nov 25, 2024

GeneDer assigned edoakes Nov 25, 2024

GeneDer added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 25, 2024

zcin closed this as completed in #48897 Nov 26, 2024

zcin closed this as completed in 0f2c62c Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] incorrect/nondeterministic `CALL OK [time]` logs with blocking concurrent requests #48903

[Serve] incorrect/nondeterministic `CALL OK [time]` logs with blocking concurrent requests #48903

paul-twelvelabs commented Nov 23, 2024

[Serve] incorrect/nondeterministic __CALL__ OK [time] logs with blocking concurrent requests #48903

[Serve] incorrect/nondeterministic __CALL__ OK [time] logs with blocking concurrent requests #48903

Comments

paul-twelvelabs commented Nov 23, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

[Serve] incorrect/nondeterministic `CALL OK [time]` logs with blocking concurrent requests #48903

[Serve] incorrect/nondeterministic `CALL OK [time]` logs with blocking concurrent requests #48903