-
Notifications
You must be signed in to change notification settings - Fork 127
RFC: use git ls to crawl all files in the project #367
Conversation
This PR makes use of git ls-files to speed-up the startup of the fuzzy finder for large repos when it can output similar results than the previous crawler. This basically happens when the option to follow symlinks is disabled (since `git ls-files` does not follow them).
ab169ad
to
74d7ba0
Compare
If I understand this correctly, it would also not improve things if the project is not being tracked by Git? |
Exactly, but currently the biggest perf bottleneck on the fuzzy finder happens only on Git projects due to the checks to see if a file is ignored by Git, which takes more than 80% of the time (more info here). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to invest some significant time into making an even greater Fuzzy Finder it would make more sense to address the architectural issues by creating some sort of global data structure that holds in memory all the project files and watches them for changes.
That "virtual filesystem" could probably be built on top of
@atom/watcher
, and could be used by other packages like the Tree View.It's important to note that this alternative is way more complex to implement and would take much longer to ship.
Interesting! @nathansobo and @as-cii started to take an approach similar to this over in xray. You're correct that it takes a pretty substantial amount of engineering to get "right". Filesystems are bags of holding for edge cases. That's a good argument for the benefits of centralizing it, but I'd guess it would take a solid few months to design and put in place.
I wonder if there's some intermediate space we could explore that might address some of the tradeoffs you've identified without dedicating that amount of effort. What if we built our own native Node module that implemented .gitignore
-compatible path filtering, but traversed symlinks and into submodules?
|
||
let output = '' | ||
|
||
// TODO: do this via a call to GitRepository (needs to be implemented). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have quite a bit of infrastructure within atom/github to interact with external git processes - with queueing, caching, and a sidecar worker process to fix a sporadic performance issue with child_process
calls... see atom/github#386 and atom/github#688 for some background on that particular mess 😆 . We also bundle a git executable, so we don't need to rely on a pre-existing user installation. Rather than calling GitRepository
, we should publish a package service that allows other packages like this one to make git calls on demand. See atom/github#1089 for some thoughts on this - this might be a good motivating case 😄
Another possibility: one of the "holy grails" for find-and-replace has been to integrate with ripgrep as a backend. ripgrep is blazing fast and solves a lot of long-standing annoyances around find-and-replace. The downside is that using ripgrep as a backend would probably involve a near-total rewrite of find-and-replace. It's likely still less effort than designing a virtual filesystem model, though? |
Thanks for the comments @smashwilson !
Yup, completely agree. I would implement the virtual filesystem as an abstraction layer on the actual
I'd try to avoid doing this: reimplementing a file tree traverser that takes care of
I really like this solution! |
fwiw, |
Super excited about this. I like the ripgrep approach because it would help both find-and-replace and fuzzy-finder. You can also still use it if the user turns off gitignore mode and wants to see all files (I forget what the setting is called; it’s different in the two packages). VS code also already has code for parsing ripgrep’s output I believe. |
I've created #369 as a follow-up of this discussion, thanks everyone for the suggestions! ✨ |
I proposed this exact change 2 years ago in pull request #301. It was rejected because it didn't handle symlinks. I don't think this change does that either. |
Ah, I just saw that in the tradeoffs. nm! |
Summary
This is some code to demonstrate the usage of
git ls-files
on the fuzzy finder, so we can discuss if it makes sense to implement this solution.In terms of performance, this PR provides huge benefits for large repositories: for example it makes the crawling 5X faster when opening the
gecko-dev
repository (it goes from 57s in the current implementation after merging #366 to 11s).Tradeoffs
Using
git ls-files
has to important tradeoffs:traverseSymlinkDirectories
isfalse
(which unfortunately would not happen often since it's enabled by default).git submodules
. This means that on a repo that has submodules the files inside these won't be accessible by the fuzzy finder.The second tradeoff is quite a big deal, so if we ever want to enable this crawler we would want to have it under some kind of feature flag that could be enabled on the settings (e.g "Enable fast mode in fuzzy finder", showing some messaging around the tradeoffs of this mode.
Additionally, in order to make this mode more discoverable we could show some kind of prompt on the Atom UI whenever somebody opens a large project: if on the first crawling we detect that the repo has more than e.g 20K files we show a message suggesting enabling the fast mode.
Alternative solutions
This is a very ad-hoc solution for a very specific problem: It does not fix the current architectural issues on the Fuzzy Finder around file watching, recrawling, etc.
If we want to invest some significant time into making an even greater Fuzzy Finder it would make more sense to address the architectural issues by creating some sort of global data structure that holds in memory all the project files and watches them for changes.
That "virtual filesystem" could probably be built on top of
@atom/watcher
, and could be used by other packages like the Tree View.It's important to note that this alternative is way more complex to implement and would take much longer to ship.
Next potential steps
This is just a WIP code to get some signal about whether we want to invest more on this path. if we want to move forward, there are a few things (some of them not trivial) that we'd need to do to be able to ship this:
git-utils
andlibgit2
to avoid spawning agit
process.