Async driver keeping connections from pool open? #796

Dz0n1Pr0 · 2022-09-08T09:50:07Z

I encountered a new issue with the Async driver version 5.0.0a2.

The error states:

ERROR in app: Exception on request POST /recommend_contact
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/quart/app.py", line 1489, in handle_request
    return await self.full_dispatch_request(request_context)
  File "/usr/local/lib/python3.7/site-packages/quart/app.py", line 1514, in full_dispatch_request
    result = await self.handle_user_exception(error)
  File "/usr/local/lib/python3.7/site-packages/quart/app.py", line 964, in handle_user_exception
    raise error
  File "/usr/local/lib/python3.7/site-packages/quart/app.py", line 1512, in full_dispatch_request
    result = await self.dispatch_request(request_context)
  File "/usr/local/lib/python3.7/site-packages/quart/app.py", line 1557, in dispatch_request
    return await self.ensure_async(handler)(**request_.view_args)
  File "/usr/local/lib/python3.7/site-packages/quart/views.py", line 57, in view
    return await current_app.ensure_async(self.dispatch_request)(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/quart_openapi/resource.py", line 71, in dispatch_request
    return await handler(*args, **kwargs)
  File "/api/app.py", line 119, in post
    recommendations = await asyncio.gather(*tasks)
  File "/api/app.py", line 108, in recommendation_task
    el['offsite'])
  File "/usr/local/lib/python3.7/site-packages/recommendation_engine-0.0.1-py3.7.egg/recommendation_engine/recommendation/recommender.py", line 32, in get_popular_contact
    ret = await self.graph_handler.run_query_async(graph_query, contact_id=contact_id, p=p, offsite=offsite)
  File "/usr/local/lib/python3.7/site-packages/recommendation_engine-0.0.1-py3.7.egg/recommendation_engine/graph_handler.py", line 636, in run_query_async
    return await session.read_transaction(self.run_transaction, query, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/neo4j/_meta.py", line 73, in inner
    return await f(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/neo4j/_async/work/session.py", line 656, in read_transaction
    READ_ACCESS, transaction_function, *args, **kwargs
  File "/usr/local/lib/python3.7/site-packages/neo4j/_async/work/session.py", line 478, in _run_transaction
    timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/neo4j/_async/work/session.py", line 392, in _open_transaction
    await self._connect(access_mode=access_mode)
  File "/usr/local/lib/python3.7/site-packages/neo4j/_async/work/session.py", line 122, in _connect
    await super()._connect(access_mode, **access_kwargs)
  File "/usr/local/lib/python3.7/site-packages/neo4j/_async/work/workspace.py", line 194, in _connect
    self._connection = await self._pool.acquire(**acquire_kwargs_)
  File "/usr/local/lib/python3.7/site-packages/neo4j/_async/io/_pool.py", line 403, in acquire
    self.address, deadline, liveness_check_timeout
  File "/usr/local/lib/python3.7/site-packages/neo4j/_async/io/_pool.py", line 216, in _acquire
    "{!r}s (timeout)".format(deadline.original_timeout)
neo4j.exceptions.ClientError: {code: None} {message: None}

I suspected it is the issue with connections being kept alive or being alive for a prolonged period of time, as when I checked "/usr/local/lib/python3.7/site-packages/neo4j/_async/io/_pool.py" I noticed the error is thrown when the pool is full and there are no avaliable connections. Also the code and message are shown as None, so I am not sure if that is a bug also.

We are running Neo4j Version 4.4.10 Community edition as a single instance.

The text was updated successfully, but these errors were encountered:

robsdedude · 2022-09-09T13:02:15Z

Hi and thanks for contacting us.

The fact that the error looks off (with code and message being None) is indeed a bug that I thought I already fixed. Turns out it did so (it's not in 4.4 anymore) but I accidentally reverted it along with other things that were meant to be reverted. Expect a fix for that soon.

Your assessment of the error cause, however, is correct in that it is thrown in neo4j/_async/io/_pool.py L216 when the pool is full. Now this could be a resource leak as you described. But is could also just be the driver being overwhelmed. To figure out what it is, I need some more info:

Can you reproduce this consistently (preferably with a minimal example)? If so, please share how.
Does the error disappear when using 5.0.0a1?
What did you do to assert that this is not a legitimate case of the driver being overloaded and the pool being maxed out?

robsdedude · 2022-09-09T13:05:44Z

A shot in the dark: are you closing all sessions you open? https://stackoverflow.com/questions/72022910/neo4j-python-api-crashing-on-multiple-queries/72031241#72031241

Dz0n1Pr0 · 2022-09-09T16:19:57Z

I tried reproducing it locally now, and it took me a while but I could get the same error message as I did in production. I set the max_connection_pool_size to a low amount (5) and I have 4 workers. After I got a docker image up and running I started sending requests, and after sending requests for an hour or so I couldn't send any more. In production we have 100 for pool_size and also 4 workers, so significantly more.
It does disappear when we use 5.0.0a1 but that one has the old error I raised the issue about a week or so ago
We first noticed this error in production, where we were getting around a few requests a minute, and the usual response time to get what we need to that request is around a few seconds at most, so I can presume that this is not the driver being overloaded but something else. Also locally, after I reach the limit of threads and I start getting that error, even if I wait for some time to send a new request I still can't get a response back, so that's why I just assume that the pool is full and that it never get's emptied.

We are closing the sessions with "with session" option that is the second way, as it is shown in the answer on stackoverflow

robsdedude · 2022-09-12T09:54:12Z

So I've been trying to reproduce the issue with a synthetic workload using

Neo4j driver: Python 5.0.0a2
cPython 3.10.4
Neo4j server: (single instance) 4.4.11 enterprise (dockerized)

import asyncio
import datetime
import random
import time
import typing as t

import neo4j
# from neo4j.debug import watch


NUM_WORKERS = 4
POOL_SIZE = 5


async def work(
    tx: neo4j.AsyncManagedTransaction,
    query: str,
    **query_params: t.Any
) -> list[neo4j.Record]:
    result = await tx.run(query, **query_params)
    records = [rec async for rec in result]
    await result.consume()  # just because I can
    return records


async def test_program(driver: neo4j.AsyncDriver) -> None:
    while True:
        node_id = random.randint(1, 1000)
        clear = random.random() < 0.001  # 0.1% chance
        query = "MERGE (n:Node {id: $id})"
        query_params = {"id": node_id}
        if clear:
            query = "MATCH (n:Node) DETACH DELETE n"
            query_params = {}
        t0 = time.time()
        async with driver.session() as session:
            await session.execute_write(work, query, **query_params)
        t1 = time.time()
        took = t1 - t0
        if took >= 3:
            print(f"took {took} seconds. too long!")
            raise Exception("took too long")


async def main() -> None:
    async with neo4j.AsyncGraphDatabase.driver(
        "neo4j://localhost:7687", auth=("neo4j", "pass"),
        max_connection_pool_size=POOL_SIZE
    ) as driver:
        await driver.verify_connectivity()
        await asyncio.gather(*(
            test_program(driver)
            for _ in range(NUM_WORKERS)
        ))


if __name__ == '__main__':
    print("starting", datetime.datetime.now())
    try:
        # with watch("neo4j"):  # enable debug logging
        asyncio.run(main())
    finally:
        print("finished", datetime.datetime.now())

After over an hour, no issues arose for me. Can you try running the above example on your machine and see if it causes the issue? If not, can you alter it (as little as possible) towards your real world application until the error occurs and then share it with me? As you might know, it's quite hard to debug a problem without being able to reproduce it.

Also, make sure your environment matches or share with me in which environment you are able to reproduce the error.

Dz0n1Pr0 · 2022-09-13T14:25:48Z

Would it be possible for you to change your code to run with python 3.7.10? That is the version that we will be using for production. As for the rest of the requirments I will share them here:

jsonschema==4.4.0
ml-metrics==0.1.4
neo4j==5.0.0a2
numpy==1.21.5
pandas==1.3.5
py2neo==2021.2.3
requests==2.27.1
scikit-learn==1.0.2
hypercorn==0.13.2
quart==0.17.0
quart-openapi==1.7.2
swagger-ui-py==0.4.0
uvloop==0.16.0
Werkzeug==2.1.1
flask_caching==2.0.1

Using the above requirements and python 3.7.10 I am able to get the error locally using our own code that is run in production.

robsdedude · 2022-09-13T14:29:49Z

Thanks. I'm pretty positive the other packages will not affect this error, but I'll give it a try with, should the error not occur without.
Switching to cPython 3.7.10 shouldn't be a problem. I'll give it try and see what I get.

Out of pure curiosity: why are you having py2neo and the official driver in your project?

Dz0n1Pr0 · 2022-09-13T15:36:58Z

We are using py2neo to run queries in Neo4j that update/create our database. We don't need it in production as something else is handling that, but I guess we still have it locally to update our local Neo4j.

robsdedude · 2022-09-14T14:02:15Z

So I ran the above code with Python 3.7.10 for 2 hours (after replacing ) -> list[neo4j.Record]: with ) -> t.List[neo4j.Record]:). Once without and once with the dependencies you listed. No errors 🤷

I also switched the server to match yours:

docker run \
    --name testneo4j \
    -p7474:7474 -p7688:7687 \
    --rm -it \
    -v $HOME/data/neo4j/data:/data \
    -v $HOME/data/neo4j/logs:/logs \
    -v $HOME/data/neo4j/import:/var/lib/neo4j/import \
    -v $HOME/data/neo4j/plugins:/plugins \
    --env-file bin/neo4j.env \
    neo4j:4.4.10-community

(both import and plugins were empty; bin/neo4j.env only specifies NEO4J_AUTH=neo4j/pass)

I had to downgrade setuptools to <58 to make ml-metrics==0.1.4 not fail installation with error in ml_metrics setup command: use_2to3 is invalid.

So I think that pretty much matches your environment. (I'm running on Linux, not sure about you, but I'd be hugely surprised if the OS made a difference here).

Ideas what to try/investigate next:

Was there any other application/programming accessing the database while you ran the above script?
Can you try running it on the latest stable cPython (3.10.7) and see if the error persists?

Dz0n1Pr0 · 2022-09-15T14:26:08Z

I have tried running the code as well, but I also tried to lower the pool size to 2 or 3 and then I got the error code fairly quickly. If you can try that as well, I think it should work.

There was no other application accessing the database other than our API.

When the error is not received from the DBMS, but instead originates from somewhere in the driver, it might not have a code and a message. In that case, we fall back to the default Exception representation. Related: * neo4j#796

The reservation count was also increased if the pool was full and the reserveration would never be turned into a new connection. This lead to the pool clogging up after a while.

When the error is not received from the DBMS, but instead originates from somewhere in the driver, it might not have a code and a message. In that case, we fall back to the default Exception representation. Related: * neo4j#796

The reservation count was also increased if the pool was full and the reserveration would never be turned into a new connection. This lead to the pool clogging up after a while.

robsdedude · 2022-09-16T12:39:28Z

I finally managed to reproduce it and hunt down the bug. Configuring the driver with a small max_connection_lifetime triggered the bug much faster as the bug can only occur when connections get removed from the pool and are being replaced with new ones.

Thanks again for reaching out and helping me to get to the core of the issue 👏

The fix should be included in the (soon to be released) patch versions 5.0.1 and 4.4.8.

Dz0n1Pr0 · 2022-09-19T07:51:06Z

Perfect, glad I could help :)

The reservation count was also increased if the pool was full and the reservation would never be turned into a new connection. This lead to the pool clogging up after a while. Add unit tests

* Improve Noe4jError representation When the error is not received from the DBMS, but instead originates from somewhere in the driver, it might not have a code and a message. In that case, we fall back to the default Exception representation. Related: * #796 * Make Neo4jError representation more robust Even though, all errors that come from the server go through `Noej4jError.hydrate` which (currently) makes sure that `code` *and* `message` are set, this might change in the future.

This was referenced Sep 16, 2022

[4.4] Improve Neo4jError representation #805

Merged

Improve Neo4jError representation #806

Merged

robsdedude closed this as completed in 0cf6741 Sep 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async driver keeping connections from pool open? #796

Async driver keeping connections from pool open? #796

Dz0n1Pr0 commented Sep 8, 2022 •

edited by fbiville

Loading

robsdedude commented Sep 9, 2022 •

edited

Loading

robsdedude commented Sep 9, 2022

Dz0n1Pr0 commented Sep 9, 2022 •

edited

Loading

robsdedude commented Sep 12, 2022 •

edited

Loading

Dz0n1Pr0 commented Sep 13, 2022

robsdedude commented Sep 13, 2022

Dz0n1Pr0 commented Sep 13, 2022

robsdedude commented Sep 14, 2022

Dz0n1Pr0 commented Sep 15, 2022 •

edited

Loading

robsdedude commented Sep 16, 2022 •

edited

Loading

Dz0n1Pr0 commented Sep 19, 2022

Async driver keeping connections from pool open? #796

Async driver keeping connections from pool open? #796

Comments

Dz0n1Pr0 commented Sep 8, 2022 • edited by fbiville Loading

robsdedude commented Sep 9, 2022 • edited Loading

robsdedude commented Sep 9, 2022

Dz0n1Pr0 commented Sep 9, 2022 • edited Loading

robsdedude commented Sep 12, 2022 • edited Loading

Dz0n1Pr0 commented Sep 13, 2022

robsdedude commented Sep 13, 2022

Dz0n1Pr0 commented Sep 13, 2022

robsdedude commented Sep 14, 2022

Dz0n1Pr0 commented Sep 15, 2022 • edited Loading

robsdedude commented Sep 16, 2022 • edited Loading

Dz0n1Pr0 commented Sep 19, 2022

Dz0n1Pr0 commented Sep 8, 2022 •

edited by fbiville

Loading

robsdedude commented Sep 9, 2022 •

edited

Loading

Dz0n1Pr0 commented Sep 9, 2022 •

edited

Loading

robsdedude commented Sep 12, 2022 •

edited

Loading

Dz0n1Pr0 commented Sep 15, 2022 •

edited

Loading

robsdedude commented Sep 16, 2022 •

edited

Loading