-
-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Getting the biggest key smaller than an id #129
Comments
Hmm yes, I see the conundrum. The range API was the first thing my brain jumped to as well, but since you want the biggest value <= to the target value, I don't see a way to make that work. It's been a while since I've been in the guts of this crate, but it seems to me like it should be possible to implement reverse streaming like BTreeMap does, although that would be a lot of effort I expect. For your specific case though, I don't think you need a stream (which is also more costly to construct). I think you can just write a variant of key lookup: Lines 681 to 700 in 3676d3e
But instead of I believe all of the routines for walking the the automaton that |
Fantastic answer! Indeed I think I will need to use the lower-level Thank you for answering. |
Yeah, the lower level API isn't as convenient, but it gives you full access to the underlying automaton. So there's a lot you can do with it! |
Alright, I think I have a working implementation. It was a bit more complicated than discussed above, due to the case where we select a branch whose key prefix is equal to the searched key but that ends up become larger than the searched key. We need to remember the latest byte of the key where we could have chosen a lower byte rather an equal one, so that we can backtrack to it when it turns out that we're in a dead end. The implementation is available here. As discussed, it uses a linear scan as a first step. I didn't attempt to profile/optimize it. I share this implementation in the hope it is correct and will be useful to someone, but do keep in mind that it is running late here, so it could definitely be faulty despite the few test cases 😅. |
Ah I see, that makes sense. I'll leave this issue open for now, as I don't quite have the time to process this fully. It might be the case that this should lead to some API editions to make this simpler, but clearly, my intuition here was wrong. So perhaps my intuition about providing reverse streams is also wrong. If reverse streams are possible (without any sort of backtracking), then it should also be possible to resolve your query without any kind of backtracking either. |
Hello, sorry for the delay in coming back to you. Life.
Allegedly I don't know enough of the implementation to answer this so could be completely wrong, but in my mental model of the fst, I don't see how to implement reverse streams without some kind of backtracking. Again, this is my superficial understanding, which might be tainted by the implementation of my use case using the existing API. If it can help with the design of the API, my specific use case involves both retrieving the "biggest key smaller than an id" and also the first key bigger than that id, preferably by querying the map only once. To explain the use case, the "biggest key smaller than an id" allows for fast random access to the value associated with any id, while the first key bigger than an id improves the sequential access (by caching the returned key/value, we only need to query the map again when we request an id bigger than the returned key). A way to achieve that would maybe be to return an iterator rather than just the key/value bigger than the id. That way, for continued sequential access we could only use the iterator. With this changes, the A naive implementation (that walks the map twice) for this function could be: pub get_from_before(&self, key: u64) -> (Option<u64>, fst::map::Stream<'_>) {
let before = self.get_before(key);
// FIXME: might be gt, need to check the behavior of get_before when key is in map
let stream = self.range.ge(key);
(before, stream)
} |
I have also found myself needing to implement this kind of search. It is not trivial! |
Hello! Thank you for making fst!
I read your blog post and thought that fst might be suitable for my purpose: I'm observing some datum that changes value over discrete time intervals, and want to get the value associated to the datum at any point in time as efficiently as possible.
Currently I'm storing the successive values for that datum in
fst::Map
, and as the value doesn't change on each unit of time interval, I'm storing the value (with the time interval as key) only when it changes.For example, possible pairs of (time, value) could be:
(0, 42), (10, 28), (1000, 0), (1001, 1), ... meaning the value was initially 42 at the origin of time, then changed to 28 at time interval 10, then stayed at 10 until it reached time interval 1000 where it was reset to 0, ...
The time unit can be expressed as a
u64
, stored in its big endian[u8]
representation (for insertions in lexicographical order). The value is a simpleu64
.Everything is well, I can store good amounts of data in a map. Now, I would like to efficiently recover the value associated with a point in time. The small difficulty is that the point in time can be, or not, actually a key in the map, since the map only contains keys where a change of value occurred. Whenever a point in time that is not in the map is requested, the value associated with the biggest key that is smaller than the requested point in time should be returned.
Here are the solutions I tried and why I didn't choose them:
fst::Map::range.le
, but unfortunately the returned stream can only be iterated on in lexicographic order, which means that the full stream must be consumed to get the greatest key that is less or equal to the request time point. As I may store billions of time points in the Map, recovering the late time points are going to iterate on almost the entire map, which is going to take too much time. If using astd::BTreeMap
, therange
function actually returns a type implementingDoubleEndedIterator
, which allows to efficiently recover the biggest smaller key.Am I missing something, or is there a missing piece in the Map API to cover that use case? Or maybe I'm mistaken in my understanding of the data structure, and that particular use case cannot be served well by fsts?
Thank you for your time reading this!
The text was updated successfully, but these errors were encountered: