Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Question: efficiency when only a few keys are used #29

Open
sanmai-NL opened this issue Dec 19, 2017 · 6 comments
Open

Question: efficiency when only a few keys are used #29

sanmai-NL opened this issue Dec 19, 2017 · 6 comments

Comments

@sanmai-NL
Copy link

I suppose the full mapping will be included in the binary when only a few keys are used?

@abonander
Copy link
Owner

Can you elaborate? Because if you can statically prove that you'll only need certain keys, then this library will be kind of redundant won't it? You can just hardcode those in.

@sanmai-NL
Copy link
Author

sanmai-NL commented Dec 19, 2017

Suppose this crate would support 1 000 different filename extensions as keys. Of those, I need 50. My thinking is ... Manually defining 50 media types isn’t very maintainable. Doing that also wouldn’t allow comparing for equality with proper RFC 6838 media type strings, since their grammar is more complex than just a type and subtype.

Currently mime has evolved to one item per Mime value. Problem is of course, the limited number of media types defined in there and the question how many should even be added (whole IANA registry?). How does your current solution based on PHF maps compare in terms of efficiency to a match on a bunch of extensions that evaluates to mime Mime const items?

Update: manually declaring mappings of extensions to Mime values also isn’t very maintainable. So I’m speaking about the keys as well. I hope to reuse a standardized mapping like yours. I see, however, that this crate still parses the media type strings that the keys were mapped to at runtime. Doesn’t that defeat the point (or let’s say limit the usefulness) of the static mapping?

@abonander
Copy link
Owner

Are we talking storage or search time constraints? It kind of seems like you're asking about both at once.

PHF maps are guaranteed to be O(1) in the worst-case: they don't deal with collisions, the hash function is tuned so that each key maps directly to the expected index slot, and if it doesn't match then the key isn't in the map.

I don't know when match cases go from jump tables to binary searches in Rust/LLVM but that's entirely implementation-defined, so it could be O(1) with so many mappings and suddenly switch to O(n log n) (or some super-constant hybrid approach) because of heuristics in the optimizer. LLVM could statically eliminate unused cases but this library is primarily intended for arbitrary user input where removing parts of the mapping is counterproductive.

We could structure the mappings differently so there's a separate map for every top-level type, so two lookups per search but searching fewer elements overall. It'd be relatively trivial at that point to disable certain top-level types based on the intended application, but all that would reduce is binary size (since the O-factor on the top-level search is negligible).

@sanmai-NL
Copy link
Author

I’m discussing both space and time yeah. I suspect though that that the media type parsing time cost dwarfs the things you mention.

@abonander
Copy link
Owner

We could structure the mappings differently so there's a separate map for every top-level type, so two lookups per search but searching fewer elements overall.

I forgot that we're actually already doing this.

@abonander
Copy link
Owner

Yeah, when Rust someday gets better const-eval we could potentially run the Mime constructors at compile-time. With cache misses it's hard to say whether the search or the mime parsing will be slower. I guess I should profile it but I haven't heard any complaints thus far. I have benchmarks implemented but they're rather naive and I haven't run them in forever anyway.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants