Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

pkgfile -b is slow #62

Open
wpyoga opened this issue Sep 13, 2022 · 1 comment
Open

pkgfile -b is slow #62

wpyoga opened this issue Sep 13, 2022 · 1 comment

Comments

@wpyoga
Copy link

wpyoga commented Sep 13, 2022

A single invocation of pkgfile -b takes around 230 ms on my laptop (Ryzen 5 4650U, 40GB DDR4, 512GB NVMe SSD).

After playing around with different binary names to search, with binary names at the start of the data files, and at the end of the data files, the invocation time doesn't differ by much. Therefore I'm guessing that most of the time is spent on loading the cache files.

I would like to propose a faster storage format, maybe using a hash index, to bring down the search time.

$ time pkgfile -b abiword
extra/abiword

real	0m0.237s
user	0m0.335s
sys	0m0.036s
$ time for i in {1..10}; do pkgfile -b abiword >/dev/null; done
extra/abiword
extra/abiword
extra/abiword
extra/abiword
extra/abiword
extra/abiword
extra/abiword
extra/abiword
extra/abiword
extra/abiword

real	0m2.280s
user	0m3.185s
sys	0m0.294s

This is my installation:

NAME="Arch Linux"
PRETTY_NAME="Arch Linux"
ID=arch
BUILD_ID=rolling
ANSI_COLOR="38;2;23;147;209"
HOME_URL="https://archlinux.org/"
DOCUMENTATION_URL="https://wiki.archlinux.org/"
SUPPORT_URL="https://bbs.archlinux.org/"
BUG_REPORT_URL="https://bugs.archlinux.org/"
LOGO=archlinux-logo

These are the pkgfile cache files:

$ ls -lh
total 361M
-rw-r--r-- 1 root root 258M Sep 13 19:45 community.files
-rw-r--r-- 1 root root 8.9M Sep 13 19:45 core.files
-rw-r--r-- 1 root root  95M Sep 13 19:45 extra.files
@falconindy
Copy link
Owner

I should do some proper profiling here, but what you're asking for only benefits a small fraction of potential queries. Yes, a hash table could benefit queries like 'pacman' or '/usr/bin/pacman', but you're back to a full scan for anything involving globbing or regex. It does, admittedly, also help with things like #27.

An embedded DB like SQLite might work here to provide a more flexible query language with indexing capabilities. That would be generally useful, but also presents a pretty substantial rewrite of the existing code. I guess that's sort of true of any storage format change...

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

2 participants