-
-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
LinkedHash(Map|Set) have weird complexity #2727
Comments
While writing up vavr-io#2727 I noticed that `LinkedHashSet.head()` is implemented `iterator().head()`. This is inefficient because `queue.iterator().next()` makes a call to `.head()`, but also to `.tail()` so as to prepare for the next `head()` call. Given the structure of the underlying `Queue`, the `head()` call is worst case `O(1)` but the tail call is worst case `O(n)`. The present worst case will be achieved if there have never been overwrites or removals from the set, which is probably a fairly common case.
While writing up #2727 I noticed that `LinkedHashSet.head()` is implemented `iterator().head()`. This is inefficient because `queue.iterator().next()` makes a call to `.head()`, but also to `.tail()` so as to prepare for the next `head()` call. Given the structure of the underlying `Queue`, the `head()` call is worst case `O(1)` but the tail call is worst case `O(n)`. The present worst case will be achieved if there have never been overwrites or removals from the set, which is probably a fairly common case.
I made a draft implementation of a CHAMP-trie based VAVR-collection, that performs all operations in constant time. |
My pull-request 2745 is supposed to address this problem for LinkedHashSet and LinkedHashMap. |
While writing up #2727 I noticed that `LinkedHashSet.head()` is implemented `iterator().head()`. This is inefficient because `queue.iterator().next()` makes a call to `.head()`, but also to `.tail()` so as to prepare for the next `head()` call. Given the structure of the underlying `Queue`, the `head()` call is worst case `O(1)` but the tail call is worst case `O(n)`. The present worst case will be achieved if there have never been overwrites or removals from the set, which is probably a fairly common case.
I have made now a release that is binary compatible with vavr 0.10.5, but which is implemented with CHAMP-based collections. You can get it here: https://github.com/wrandelshofer/vavr/releases/tag/v0.10.5 |
Mostly leaving this as a note. LinkedHashMap and LinkedHashSet are, as implemented in the Java standard library, pretty cool structures. Map entries contain a doubly linked list to next and previous elements, and because of the mutability, the cost of maintaining the list structure is always O(1). Effectively, you have a structure which can be treated as both a list and a set without the downsides in most practical cases.
In Vavr, these structures are implemented as a pair of a HashMap and a Queue. This pairing is rather uncool, because the complexity of various map operations becomes case-dependent.
For example (all complexities are listed as complexities on the queue, not on the paired HashMap):
When doing a put, if the element does not exist, the put is O(1). However, doing a put on an element that exists already is O(n) because the element's node is replaced in the queue (and I think the entire queue is copied at this time).
Likewise, doing a remove is O(n) because the queue must be rewritten.
The queue is stored as a head and tail with the tail possibly being reversed on dequeue. However, this is only amortized efficient when the reading operation actually dequeues (e.g. n enqs followed by n deqs will do one queue reversal, but n enqs followed by n iterations through the queue will do n queue reversals).
In practice this means that the datastructure is genuinely much less effective than the normal HashMap - one converting their HashMap into a LinkedHashMap should expect a huge performance degradation which is unexpected to one coming from normal Java maps.
An implementation with perhaps slightly poorer top-end performance but that would have much more forgiving worst case performance might be to do something more like:
This structure would be O(logn) on all operations, and generally avoid the sharp edges of the current structure. It would use quite a bit more memory, although this could be minimized by e.g. avoiding boxing of longs or at least sharing them. Of course, it's possible that top-end performance would suffer, especially since inserting in sorted order is the worst case for a red-black tree.
The text was updated successfully, but these errors were encountered: