Skip to content

Async driver keeping connections from pool open? #796

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
Dz0n1Pr0 opened this issue Sep 8, 2022 · 11 comments
Closed

Async driver keeping connections from pool open? #796

Dz0n1Pr0 opened this issue Sep 8, 2022 · 11 comments

Comments

@Dz0n1Pr0
Copy link

Dz0n1Pr0 commented Sep 8, 2022

I encountered a new issue with the Async driver version 5.0.0a2.

The error states:

ERROR in app: Exception on request POST /recommend_contact
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/quart/app.py", line 1489, in handle_request
    return await self.full_dispatch_request(request_context)
  File "/usr/local/lib/python3.7/site-packages/quart/app.py", line 1514, in full_dispatch_request
    result = await self.handle_user_exception(error)
  File "/usr/local/lib/python3.7/site-packages/quart/app.py", line 964, in handle_user_exception
    raise error
  File "/usr/local/lib/python3.7/site-packages/quart/app.py", line 1512, in full_dispatch_request
    result = await self.dispatch_request(request_context)
  File "/usr/local/lib/python3.7/site-packages/quart/app.py", line 1557, in dispatch_request
    return await self.ensure_async(handler)(**request_.view_args)
  File "/usr/local/lib/python3.7/site-packages/quart/views.py", line 57, in view
    return await current_app.ensure_async(self.dispatch_request)(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/quart_openapi/resource.py", line 71, in dispatch_request
    return await handler(*args, **kwargs)
  File "/api/app.py", line 119, in post
    recommendations = await asyncio.gather(*tasks)
  File "/api/app.py", line 108, in recommendation_task
    el['offsite'])
  File "/usr/local/lib/python3.7/site-packages/recommendation_engine-0.0.1-py3.7.egg/recommendation_engine/recommendation/recommender.py", line 32, in get_popular_contact
    ret = await self.graph_handler.run_query_async(graph_query, contact_id=contact_id, p=p, offsite=offsite)
  File "/usr/local/lib/python3.7/site-packages/recommendation_engine-0.0.1-py3.7.egg/recommendation_engine/graph_handler.py", line 636, in run_query_async
    return await session.read_transaction(self.run_transaction, query, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/neo4j/_meta.py", line 73, in inner
    return await f(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/neo4j/_async/work/session.py", line 656, in read_transaction
    READ_ACCESS, transaction_function, *args, **kwargs
  File "/usr/local/lib/python3.7/site-packages/neo4j/_async/work/session.py", line 478, in _run_transaction
    timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/neo4j/_async/work/session.py", line 392, in _open_transaction
    await self._connect(access_mode=access_mode)
  File "/usr/local/lib/python3.7/site-packages/neo4j/_async/work/session.py", line 122, in _connect
    await super()._connect(access_mode, **access_kwargs)
  File "/usr/local/lib/python3.7/site-packages/neo4j/_async/work/workspace.py", line 194, in _connect
    self._connection = await self._pool.acquire(**acquire_kwargs_)
  File "/usr/local/lib/python3.7/site-packages/neo4j/_async/io/_pool.py", line 403, in acquire
    self.address, deadline, liveness_check_timeout
  File "/usr/local/lib/python3.7/site-packages/neo4j/_async/io/_pool.py", line 216, in _acquire
    "{!r}s (timeout)".format(deadline.original_timeout)
neo4j.exceptions.ClientError: {code: None} {message: None}

I suspected it is the issue with connections being kept alive or being alive for a prolonged period of time, as when I checked "/usr/local/lib/python3.7/site-packages/neo4j/_async/io/_pool.py" I noticed the error is thrown when the pool is full and there are no avaliable connections. Also the code and message are shown as None, so I am not sure if that is a bug also.

We are running Neo4j Version 4.4.10 Community edition as a single instance.

@robsdedude
Copy link
Member

robsdedude commented Sep 9, 2022

Hi and thanks for contacting us.

The fact that the error looks off (with code and message being None) is indeed a bug that I thought I already fixed. Turns out it did so (it's not in 4.4 anymore) but I accidentally reverted it along with other things that were meant to be reverted. Expect a fix for that soon.

Your assessment of the error cause, however, is correct in that it is thrown in neo4j/_async/io/_pool.py L216 when the pool is full. Now this could be a resource leak as you described. But is could also just be the driver being overwhelmed. To figure out what it is, I need some more info:

  1. Can you reproduce this consistently (preferably with a minimal example)? If so, please share how.
  2. Does the error disappear when using 5.0.0a1?
  3. What did you do to assert that this is not a legitimate case of the driver being overloaded and the pool being maxed out?

@robsdedude
Copy link
Member

A shot in the dark: are you closing all sessions you open? https://stackoverflow.com/questions/72022910/neo4j-python-api-crashing-on-multiple-queries/72031241#72031241

@Dz0n1Pr0
Copy link
Author

Dz0n1Pr0 commented Sep 9, 2022

  1. I tried reproducing it locally now, and it took me a while but I could get the same error message as I did in production. I set the max_connection_pool_size to a low amount (5) and I have 4 workers. After I got a docker image up and running I started sending requests, and after sending requests for an hour or so I couldn't send any more. In production we have 100 for pool_size and also 4 workers, so significantly more.
  2. It does disappear when we use 5.0.0a1 but that one has the old error I raised the issue about a week or so ago
  3. We first noticed this error in production, where we were getting around a few requests a minute, and the usual response time to get what we need to that request is around a few seconds at most, so I can presume that this is not the driver being overloaded but something else. Also locally, after I reach the limit of threads and I start getting that error, even if I wait for some time to send a new request I still can't get a response back, so that's why I just assume that the pool is full and that it never get's emptied.

We are closing the sessions with "with session" option that is the second way, as it is shown in the answer on stackoverflow

@robsdedude
Copy link
Member

robsdedude commented Sep 12, 2022

So I've been trying to reproduce the issue with a synthetic workload using

  • Neo4j driver: Python 5.0.0a2
  • cPython 3.10.4
  • Neo4j server: (single instance) 4.4.11 enterprise (dockerized)
import asyncio
import datetime
import random
import time
import typing as t

import neo4j
# from neo4j.debug import watch


NUM_WORKERS = 4
POOL_SIZE = 5


async def work(
    tx: neo4j.AsyncManagedTransaction,
    query: str,
    **query_params: t.Any
) -> list[neo4j.Record]:
    result = await tx.run(query, **query_params)
    records = [rec async for rec in result]
    await result.consume()  # just because I can
    return records


async def test_program(driver: neo4j.AsyncDriver) -> None:
    while True:
        node_id = random.randint(1, 1000)
        clear = random.random() < 0.001  # 0.1% chance
        query = "MERGE (n:Node {id: $id})"
        query_params = {"id": node_id}
        if clear:
            query = "MATCH (n:Node) DETACH DELETE n"
            query_params = {}
        t0 = time.time()
        async with driver.session() as session:
            await session.execute_write(work, query, **query_params)
        t1 = time.time()
        took = t1 - t0
        if took >= 3:
            print(f"took {took} seconds. too long!")
            raise Exception("took too long")


async def main() -> None:
    async with neo4j.AsyncGraphDatabase.driver(
        "neo4j://localhost:7687", auth=("neo4j", "pass"),
        max_connection_pool_size=POOL_SIZE
    ) as driver:
        await driver.verify_connectivity()
        await asyncio.gather(*(
            test_program(driver)
            for _ in range(NUM_WORKERS)
        ))


if __name__ == '__main__':
    print("starting", datetime.datetime.now())
    try:
        # with watch("neo4j"):  # enable debug logging
        asyncio.run(main())
    finally:
        print("finished", datetime.datetime.now())

After over an hour, no issues arose for me. Can you try running the above example on your machine and see if it causes the issue? If not, can you alter it (as little as possible) towards your real world application until the error occurs and then share it with me? As you might know, it's quite hard to debug a problem without being able to reproduce it.

Also, make sure your environment matches or share with me in which environment you are able to reproduce the error.

@Dz0n1Pr0
Copy link
Author

Would it be possible for you to change your code to run with python 3.7.10? That is the version that we will be using for production. As for the rest of the requirments I will share them here:

jsonschema==4.4.0
ml-metrics==0.1.4
neo4j==5.0.0a2
numpy==1.21.5
pandas==1.3.5
py2neo==2021.2.3
requests==2.27.1
scikit-learn==1.0.2
hypercorn==0.13.2
quart==0.17.0
quart-openapi==1.7.2
swagger-ui-py==0.4.0
uvloop==0.16.0
Werkzeug==2.1.1
flask_caching==2.0.1

Using the above requirements and python 3.7.10 I am able to get the error locally using our own code that is run in production.

@robsdedude
Copy link
Member

Thanks. I'm pretty positive the other packages will not affect this error, but I'll give it a try with, should the error not occur without.
Switching to cPython 3.7.10 shouldn't be a problem. I'll give it try and see what I get.

Out of pure curiosity: why are you having py2neo and the official driver in your project?

@Dz0n1Pr0
Copy link
Author

We are using py2neo to run queries in Neo4j that update/create our database. We don't need it in production as something else is handling that, but I guess we still have it locally to update our local Neo4j.

@robsdedude
Copy link
Member

So I ran the above code with Python 3.7.10 for 2 hours (after replacing ) -> list[neo4j.Record]: with ) -> t.List[neo4j.Record]:). Once without and once with the dependencies you listed. No errors 🤷

I also switched the server to match yours:

docker run \
    --name testneo4j \
    -p7474:7474 -p7688:7687 \
    --rm -it \
    -v $HOME/data/neo4j/data:/data \
    -v $HOME/data/neo4j/logs:/logs \
    -v $HOME/data/neo4j/import:/var/lib/neo4j/import \
    -v $HOME/data/neo4j/plugins:/plugins \
    --env-file bin/neo4j.env \
    neo4j:4.4.10-community

(both import and plugins were empty; bin/neo4j.env only specifies NEO4J_AUTH=neo4j/pass)

I had to downgrade setuptools to <58 to make ml-metrics==0.1.4 not fail installation with error in ml_metrics setup command: use_2to3 is invalid.

So I think that pretty much matches your environment. (I'm running on Linux, not sure about you, but I'd be hugely surprised if the OS made a difference here).

Ideas what to try/investigate next:

  • Was there any other application/programming accessing the database while you ran the above script?
  • Can you try running it on the latest stable cPython (3.10.7) and see if the error persists?

@Dz0n1Pr0
Copy link
Author

Dz0n1Pr0 commented Sep 15, 2022

I have tried running the code as well, but I also tried to lower the pool size to 2 or 3 and then I got the error code fairly quickly. If you can try that as well, I think it should work.

There was no other application accessing the database other than our API.

robsdedude added a commit to robsdedude/neo4j-python-driver that referenced this issue Sep 16, 2022
When the error is not received from the DBMS, but instead originates from
somewhere in the driver, it might not have a code and a message. In that case,
we fall back to the default Exception representation.

Related:
 * neo4j#796
robsdedude added a commit to robsdedude/neo4j-python-driver that referenced this issue Sep 16, 2022
The reservation count was also increased if the pool was full and the
reserveration would never be turned into a new connection. This lead to the pool
clogging up after a while.
robsdedude added a commit to robsdedude/neo4j-python-driver that referenced this issue Sep 16, 2022
When the error is not received from the DBMS, but instead originates from
somewhere in the driver, it might not have a code and a message. In that case,
we fall back to the default Exception representation.

Related:
 * neo4j#796
robsdedude added a commit to robsdedude/neo4j-python-driver that referenced this issue Sep 16, 2022
The reservation count was also increased if the pool was full and the
reserveration would never be turned into a new connection. This lead to the pool
clogging up after a while.
@robsdedude
Copy link
Member

robsdedude commented Sep 16, 2022

I finally managed to reproduce it and hunt down the bug. Configuring the driver with a small max_connection_lifetime triggered the bug much faster as the bug can only occur when connections get removed from the pool and are being replaced with new ones.

Thanks again for reaching out and helping me to get to the core of the issue 👏

The fix should be included in the (soon to be released) patch versions 5.0.1 and 4.4.8.

@Dz0n1Pr0
Copy link
Author

Perfect, glad I could help :)

robsdedude added a commit that referenced this issue Sep 19, 2022
The reservation count was also increased if the pool was full and the
reservation would never be turned into a new connection. This lead to the pool
clogging up after a while.

Add unit tests
robsdedude added a commit that referenced this issue Sep 19, 2022
* Improve Noe4jError representation

When the error is not received from the DBMS, but instead originates from
somewhere in the driver, it might not have a code and a message. In that case,
we fall back to the default Exception representation.

Related:
 * #796

* Make Neo4jError representation more robust

Even though, all errors that come from the server go through
`Noej4jError.hydrate` which (currently) makes sure that `code` *and* `message`
are set, this might change in the future.
robsdedude added a commit that referenced this issue Sep 19, 2022
* Improve Noe4jError representation

When the error is not received from the DBMS, but instead originates from
somewhere in the driver, it might not have a code and a message. In that case,
we fall back to the default Exception representation.

Related:
 * #796

* Make Neo4jError representation more robust

Even though, all errors that come from the server go through
`Noej4jError.hydrate` which (currently) makes sure that `code` *and* `message`
are set, this might change in the future.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants