Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Serve] incorrect/nondeterministic __CALL__ OK [time] logs with blocking concurrent requests #48903

Closed
paul-twelvelabs opened this issue Nov 23, 2024 · 0 comments · Fixed by #48897
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue

Comments

@paul-twelvelabs
Copy link

What happened + What you expected to happen

Deployment logs of the form __CALL__ OK [time] can display the incorrect execution time if the Deployment has synchronous code that blocks the main asyncio event loop. A simple example is the following,

from time import sleep, perf_counter
from ray import serve

@serve.deployment
class BlockingDeployment:
    async def __call__(self):
        # block asyncio event loop
        sleep(1.)


app = BlockingDeployment.bind()
serve.run(app)


# -- benchmark: 5 concurrent requests
import requests
from concurrent.futures import ThreadPoolExecutor

def make_request(_):
    t0 = perf_counter()
    _ = requests.get("http://localhost:8000/")
    print(f">>>> GET {perf_counter() - t0:0.3f} s")


with ThreadPoolExecutor(max_workers=5) as exc:
    print(list(exc.map(make_request, range(5))))

In this toy example, I send 5 concurrent requests to the the BlockingDeployment, which calls time.sleep(1.) and blocks the event loop. I'd expect requests to queue up and take 1, 2, 3, 4, 5s. I see the following output

>>>> GET 1.040 s
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 1010.3ms
>>>> GET 2.037 s
>>>> GET 3.048 s
>>>> GET 4.050 s
>>>> GET 5.058 s
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 5027.5ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 5028.3ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 5027.6ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 5028.1ms

The ">>>> GET" logs show the requests are returning in the correct time, but clearly the __CALL__ logs are mostly wrong: the first correctly says ~1s but all successive logs say ~5s. I'd expect something like

(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 1010.3ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 2027.5ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 3028.3ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 4027.6ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 5028.1ms

Moreover, this output is nondeterministic and may change on successive runs,

(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 3014.8ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 3015.0ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 3015.7ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 5023.1ms
(ServeReplica:default:BlockingDeployment pid=34233) INFO...replica.py:378 - __CALL__ OK 5022.4ms

Versions / Dependencies

Python 3.12
ray==2.38.0

Reproduction script

[see above.]

Issue Severity

Low: It annoys or frustrates me.

@paul-twelvelabs paul-twelvelabs added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 23, 2024
@jcotant1 jcotant1 added the serve Ray Serve Related Issue label Nov 25, 2024
@GeneDer GeneDer added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 25, 2024
@zcin zcin closed this as completed in 0f2c62c Nov 26, 2024
jecsand838 pushed a commit to jecsand838/ray that referenced this issue Dec 4, 2024
## Why are these changes needed?

Adds a feature flag to run sync user-defined methods in a threadpool by
default. This matches the existing behavior when using a FastAPI
ingress.

This should address a lot of user confusion and make it easier to write
performant code by default. For example, just sticking a torch model
call in a sync method will now provide reasonable performance out of the
box.

However, there may be some existing user code that is not thread safe,
so we need to do a gentle migration. This PR introduces the behavior
behind a feature flag and warns users about the upcoming change and how
to opt into the new behavior or maintain existing behavior once it does
(just adding `async def` will do it).

I've opted to set the max thread pool size to `max_ongoing_requests`,
which seems like a reasonable policy. If needed we can add a user-facing
API for this in the future.

TODO before merging:

- [x] Get it working for sync generators.
- [x] Add warning for default change (people can keep behavior by
changing to async def).
- [x] Add/update UserCallableWrapper tests.
- [x] Add/update some integration tests (verify that request context is
set correctly!).
- [x] Set maximum thread pool size.

## Related issue number

Closes ray-project#44354
Closes ray-project#44403
Closes ray-project#48903

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Connor Sanders <connor@elastiflow.com>
dentiny pushed a commit to dentiny/ray that referenced this issue Dec 7, 2024
## Why are these changes needed?

Adds a feature flag to run sync user-defined methods in a threadpool by
default. This matches the existing behavior when using a FastAPI
ingress.

This should address a lot of user confusion and make it easier to write
performant code by default. For example, just sticking a torch model
call in a sync method will now provide reasonable performance out of the
box.

However, there may be some existing user code that is not thread safe,
so we need to do a gentle migration. This PR introduces the behavior
behind a feature flag and warns users about the upcoming change and how
to opt into the new behavior or maintain existing behavior once it does
(just adding `async def` will do it).

I've opted to set the max thread pool size to `max_ongoing_requests`,
which seems like a reasonable policy. If needed we can add a user-facing
API for this in the future.

TODO before merging:

- [x] Get it working for sync generators.
- [x] Add warning for default change (people can keep behavior by
changing to async def).
- [x] Add/update UserCallableWrapper tests.
- [x] Add/update some integration tests (verify that request context is
set correctly!).
- [x] Set maximum thread pool size.

## Related issue number

Closes ray-project#44354
Closes ray-project#44403
Closes ray-project#48903

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: hjiang <dentinyhao@gmail.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants