Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add built in connection pool #108

Merged
merged 9 commits into from
Jun 10, 2018
Merged

Add built in connection pool #108

merged 9 commits into from
Jun 10, 2018

Conversation

fishcakez
Copy link
Member

@fishcakez fishcakez commented Dec 11, 2017

Adds a built in connection pool that will increase throughput by 3-7% for most single queries and load levels in Ecto. Other workloads may get 0%-10% improvements.

The pool uses an ETS queue and some other ETS trickier to avoid tracking busy connections/clients. The later feature (if this was only pool style) would allow us to stop using the pdict so that we can ensure no data leaks for non-closure transactions.

We might be able to get better performance by removing either or both of these ETS use cases.

IO overload handling is better (and simpler) than Sojourn pool. In future we could improve the case where the database is down or other side of netsplit because we can know when there are no connections.

CPU overload handling is not as good as Sojourn pool but could be improved by limiting the queue size and dropping all new checkouts above that size.

There is no overflow. This is something we could do in future if we have a simple and good algorithm.

Idle ping handling is improved over other pools because pings will never synchronize and happen as rarely as possible.

Timeout handling is improved over other pools because an absolute deadline can be used (useful for multiple repo or multi service calls) and the queue time is included in the timeout if not absolute. Also queue timeouts are DBConnection.ConnectionError errors, like Sojourn pool, so easier to handle.

Processes hibernation occurs efficiently and does not affect performance. It could be more efficient if hibernate on backoff.

Overhead for pool_size: 1 is insignificant compared to DBConnection.Connection so could replace it.

@josevalim
Copy link
Member

This looks great! What is the plan here? Do we plan to remove support for at least the Sojourn pool then? I think we should come up with a plan to remove code before adding even more. :)

I also recall you saying this supports ownership out of the box. Should we force this pool to be used whenever we want the ownership style tests?

@fishcakez
Copy link
Member Author

What is the plan here?

There isn't a plan here yet, just that I knew we could have a faster pool.

Do we plan to remove support for at least the Sojourn pool then?

I would like to do this very much. I would actually like to remove all the pools and just use this one and ownership. Or some other big simplification.

An alternative would be to always supervise connections away from the actual pool process like this so that most of the logic is the same shared off-tree worker, with light messaging passing/queuing loop in the pooled worker. By connecting in another process we can improve the queuing in all existing pools (except Sojourn that already does it). I think it would also allow us to simplify Ownership by not having a nested pool and always running pool_size proxy processes.

I also recall you saying this supports ownership out of the box. Should we force this pool to be used whenever we want the ownership style tests?

🤔 this doesn't actually work with ownership because of #106 (nb this is issue trivial to solve for most cases by having a watcher per pool type). If we stopped allowing custom pool behaviors it could work by using the same message passing semantics, i.e. contract on messages rather than functions. This would also solve the issue of having to pass pool option.

@josevalim
Copy link
Member

I would like to do this very much. I would actually like to remove all the pools and just use this one and ownership. Or some other big simplification.

👍

If we stopped allowing custom pool behaviors it could work by using the same message passing semantics

👍

@fishcakez
Copy link
Member Author

We should add an expire_timeout or similar option that disconnects a connection on checkin/poll if it is older than a timeout.

@michalmuskala
Copy link
Member

michalmuskala commented Jan 30, 2018

Today @DavidAntaramian asked on slack about gathering some statistics about the pool - in particular stats about saturation and load of the pool for optimising and monitoring the application.

If we're designing a new pool, maybe it would make sense to also include some APIs to gather this data or at least provide some hooks so it could be integrated with some metrics library?

@michalmuskala
Copy link
Member

Also, regarding overflow. I think for database connections actually a way to dynamically change the pool size rather than poolboy-style overflow would be more useful, because of the connection cost.

If we had a way to gather metrics, it would become possible to have a component regulating the pool size (this could be integrated as an external library), growing it during overload and decreasing during quiet time, but smoothly.

@fishcakez
Copy link
Member Author

fishcakez commented Jan 30, 2018

If we're designing a new pool, maybe it would make sense to also include some APIs to gather this data or at least provide some hooks so it could be integrated with some metrics library?

Yes definitely but for metrics we can worry after the implementation I think. It will be easier to do once we have removed poolboy/sojourn, we do not need to get ahead of ourselves :D.

If we had a way to gather metrics, it would become possible to have a component regulating the pool size (this could be integrated as an external library), growing it during overload and decreasing during quiet time, but smoothly.

This is what the Sojourn pool does already FWIW (and provides hookable metrics). I am slightly 👎 on any overflow feature to be honest because of the complexity.

Not required to monitor as give away and heir handles tracking caller
@fishcakez
Copy link
Member Author

@michalmuskala I removed the monitoring as its not required, unsure how minimal the increase in performance is but it simplifies the code slightly.

@michalmuskala
Copy link
Member

@fishcakez I got some failures when I tried the changes:

07:34:36.676 [error] GenServer #PID<0.187.0> terminating
** (CaseClauseError) no case clause matching: {-576460751509, #Reference<0.4040485667.3719692292.85506>}
    (db_connection) lib/db_connection/connection_pool.ex:155: DBConnection.ConnectionPool.handle_info/2
    (stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:686: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: {:timeout, #Reference<0.4040485667.3719561220.85547>, {-576460750509, -576460751509}}
State: {:ready, #Reference<0.4040485667.3719692292.85497>, %{delay: 0, idle: #Reference<0.4040485667.3719561220.85548>, idle_interval: 1000, interval: 1000, next: -576460750509, poll: #Reference<0.4040485667.3719561220.85547>, slow: false, target: 50}}

@fishcakez
Copy link
Member Author

Oh there was a subtle bug in the old code that didn't get realized until that change.

@michalmuskala
Copy link
Member

Here are some benchmarks with raw db_connection, without actually connecting to a database:

https://gist.github.com/michalmuskala/e503d650718d122fbdb478372b83c46a

@michalmuskala
Copy link
Member

And here are some with postgres connecting to a real database:

https://gist.github.com/michalmuskala/00d163ddd0c7eff1770b27cd28ad8dfb

@redink
Copy link

redink commented Apr 3, 2018

Any updates?

@fishcakez
Copy link
Member Author

I've moved to raw message instead of GenServer call. This would be ready to merge, and then work on changes to support the raw message, instead of option in later PRs

def handle_info({:db_connection, from, {:checkout, now, queue?}}, {:busy, queue, _} = busy) do
case queue? do
true ->
:ets.insert(queue, {{now, System.unique_integer(), from}})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if instead of unique integers we would hold something like a logical clock in the server? Wouldn't it be slightly less overhead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it be less overhead? I think the code is simpler without changing the server's state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how expensive the unique_integer() call is - that's all

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The zero arity one is very efficient so we should be good. Perhaps if it produces bignums it might make other parts slower. {monotonic_time, unique_integer} is the preferred way to create a unique timestamp, we can move this to the caller if it proves to be cheaper.

checkout_holder(holder, from, queue) and :ets.delete(queue, key)
{:noreply, ready}
:"$end_of_table" ->
handle_info(checkout, put_elem(ready, 0, :busy))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be clearer if we had handle_info(checkout, {:busy, queue, codel}) here. I think the put_elem call is a bit distracting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put_elem is faster, we could have helpers for updating the state and inline them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the perf difference will be noticeable here 😛

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will refactor this part for clarity and avoid some pattern matching.

for {sent, from} <- :ets.select(queue, select_slow) do
drop(time - sent, from)
end
:ets.select_delete(queue, [{match, guards, [true]}])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to select the full key in the previous pass and then do the deletes by key instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By better do you mean faster or clearer? I think ETS should be able to optimize this so its faster as its an ordered set so becomes o(n) where n is length to delete, whereas delete each would be o(nlog(m)). However code is likely clearer.

@josevalim
Copy link
Member

josevalim commented Apr 25, 2018 via email

@fishcakez
Copy link
Member Author

Merging to unblock work on moving ownership to use this pool. We can improve this pool further later on.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants