RFC: use git ls to crawl all files in the project #367

rafeca · 2019-02-27T12:42:10Z

Summary

This is some code to demonstrate the usage of git ls-files on the fuzzy finder, so we can discuss if it makes sense to implement this solution.

In terms of performance, this PR provides huge benefits for large repositories: for example it makes the crawling 5X faster when opening the gecko-dev repository (it goes from 57s in the current implementation after merging #366 to 11s).

Tradeoffs

Using git ls-files has to important tradeoffs:

It does not support symlinks (as stated in Use git ls-files for loading paths when available. #301). To overcome this tradeoff I've decided to only enable this codepath whenever traverseSymlinkDirectories is false (which unfortunately would not happen often since it's enabled by default).
It does not support git submodules. This means that on a repo that has submodules the files inside these won't be accessible by the fuzzy finder.

The second tradeoff is quite a big deal, so if we ever want to enable this crawler we would want to have it under some kind of feature flag that could be enabled on the settings (e.g "Enable fast mode in fuzzy finder", showing some messaging around the tradeoffs of this mode.

Additionally, in order to make this mode more discoverable we could show some kind of prompt on the Atom UI whenever somebody opens a large project: if on the first crawling we detect that the repo has more than e.g 20K files we show a message suggesting enabling the fast mode.

Alternative solutions

This is a very ad-hoc solution for a very specific problem: It does not fix the current architectural issues on the Fuzzy Finder around file watching, recrawling, etc.

If we want to invest some significant time into making an even greater Fuzzy Finder it would make more sense to address the architectural issues by creating some sort of global data structure that holds in memory all the project files and watches them for changes.

That "virtual filesystem" could probably be built on top of @atom/watcher, and could be used by other packages like the Tree View.

It's important to note that this alternative is way more complex to implement and would take much longer to ship.

Next potential steps

This is just a WIP code to get some signal about whether we want to invest more on this path. if we want to move forward, there are a few things (some of them not trivial) that we'd need to do to be able to ship this:

Implement the missing APIs on git-utils and libgit2 to avoid spawning a git process.
Add some kind of setting to enable this fast mode.
Make the setting discoverable by adding a prompt on large projects.

This PR makes use of git ls-files to speed-up the startup of the fuzzy finder for large repos when it can output similar results than the previous crawler. This basically happens when the option to follow symlinks is disabled (since `git ls-files` does not follow them).

lee-dohm · 2019-02-27T20:19:02Z

If I understand this correctly, it would also not improve things if the project is not being tracked by Git?

rafeca · 2019-02-28T10:16:08Z

If I understand this correctly, it would also not improve things if the project is not being tracked by Git?

Exactly, but currently the biggest perf bottleneck on the fuzzy finder happens only on Git projects due to the checks to see if a file is ignored by Git, which takes more than 80% of the time (more info here).

smashwilson

If we want to invest some significant time into making an even greater Fuzzy Finder it would make more sense to address the architectural issues by creating some sort of global data structure that holds in memory all the project files and watches them for changes.

That "virtual filesystem" could probably be built on top of @atom/watcher, and could be used by other packages like the Tree View.

It's important to note that this alternative is way more complex to implement and would take much longer to ship.

Interesting! @nathansobo and @as-cii started to take an approach similar to this over in xray. You're correct that it takes a pretty substantial amount of engineering to get "right". Filesystems are bags of holding for edge cases. That's a good argument for the benefits of centralizing it, but I'd guess it would take a solid few months to design and put in place.

I wonder if there's some intermediate space we could explore that might address some of the tradeoffs you've identified without dedicating that amount of effort. What if we built our own native Node module that implemented .gitignore-compatible path filtering, but traversed symlinks and into submodules?

smashwilson · 2019-02-28T13:22:20Z

lib/load-paths-handler.js

+
+      let output = ''
+
+      // TODO: do this via a call to GitRepository (needs to be implemented).


We have quite a bit of infrastructure within atom/github to interact with external git processes - with queueing, caching, and a sidecar worker process to fix a sporadic performance issue with child_process calls... see atom/github#386 and atom/github#688 for some background on that particular mess 😆 . We also bundle a git executable, so we don't need to rely on a pre-existing user installation. Rather than calling GitRepository, we should publish a package service that allows other packages like this one to make git calls on demand. See atom/github#1089 for some thoughts on this - this might be a good motivating case 😄

smashwilson · 2019-02-28T13:37:19Z

Another possibility: one of the "holy grails" for find-and-replace has been to integrate with ripgrep as a backend. ripgrep is blazing fast and solves a lot of long-standing annoyances around find-and-replace. The downside is that using ripgrep as a backend would probably involve a near-total rewrite of find-and-replace. It's likely still less effort than designing a virtual filesystem model, though?

rafeca · 2019-02-28T19:05:16Z

Thanks for the comments @smashwilson !

Interesting! @nathansobo and @as-cii started to take an approach similar to this over in xray. You're correct that it takes a pretty substantial amount of engineering to get "right". Filesystems are bags of holding for edge cases. That's a good argument for the benefits of centralizing it, but I'd guess it would take a solid few months to design and put in place.

Yup, completely agree. I would implement the virtual filesystem as an abstraction layer on the actual fs, with just a subset of the available APIs. Still, designing something like that takes a substantial amount of time and the user benefits (apart from the architectural ones) are not clear.

I wonder if there's some intermediate space we could explore that might address some of the tradeoffs you've identified without dedicating that amount of effort. What if we built our own native Node module that implemented .gitignore-compatible path filtering, but traversed symlinks and into submodules?

I'd try to avoid doing this: reimplementing a file tree traverser that takes care of .gitignore rules would be quite a bit of effort as well (we would have to replicate git's behaviour around nested .gitignore rules, user-defined rules, etc), and we won't really know what perf gains will this give us until it's done (probably we would have to put a lot of effort on optimizations to reach git perf).

Another possibility: one of the "holy grails" for find-and-replace has been to integrate with ripgrep as a backend. ripgrep is blazing fast and solves a lot of long-standing annoyances around find-and-replace. The downside is that using ripgrep as a backend would probably involve a near-total rewrite of find-and-replace. It's likely still less effort than designing a virtual filesystem model, though?

I really like this solution! ripgrep should give us a lot of performance out of the box, and it should not be hard to hack on a prototype to get actual improvement values (so then we can decide if it's worth it).

rafeca · 2019-02-28T19:07:30Z

fwiw, git ls-files has an option to traverse submodules, but it's currently incompatible with the option that returns untracked files.

maxbrunsfeld · 2019-03-01T03:28:35Z

Super excited about this. I like the ripgrep approach because it would help both find-and-replace and fuzzy-finder. You can also still use it if the user turns off gitignore mode and wants to see all files (I forget what the setting is called; it’s different in the two packages). VS code also already has code for parsing ripgrep’s output I believe.

rafeca · 2019-03-12T16:10:37Z

I've created #369 as a follow-up of this discussion, thanks everyone for the suggestions! ✨

eeejay · 2019-03-14T03:18:02Z

I proposed this exact change 2 years ago in pull request #301. It was rejected because it didn't handle symlinks. I don't think this change does that either.

eeejay · 2019-03-14T03:19:06Z

Ah, I just saw that in the tradeoffs. nm!

rafeca force-pushed the wip-use-git-ls branch from ab169ad to 74d7ba0 Compare February 27, 2019 17:25

rafeca changed the title ~~WIP: use git ls to crawl all files in the project~~ RFC: use git ls to crawl all files in the project Feb 27, 2019

smashwilson reviewed Feb 28, 2019

View reviewed changes

rafeca mentioned this pull request Mar 12, 2019

Add option to use ripgrep for crawling the list of files #369

Merged

10 tasks

rafeca closed this Mar 12, 2019

eeejay mentioned this pull request Mar 14, 2019

Use git ls-files for loading paths when available. #301

Closed

rafeca added the FY2019Q5 atom perf label Mar 19, 2019

rafeca deleted the wip-use-git-ls branch March 19, 2019 11:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: use git ls to crawl all files in the project #367

RFC: use git ls to crawl all files in the project #367

rafeca commented Feb 27, 2019 •

edited

Loading

lee-dohm commented Feb 27, 2019

rafeca commented Feb 28, 2019

smashwilson left a comment

smashwilson Feb 28, 2019

smashwilson commented Feb 28, 2019

rafeca commented Feb 28, 2019

rafeca commented Feb 28, 2019

maxbrunsfeld commented Mar 1, 2019 •

edited

Loading

rafeca commented Mar 12, 2019

eeejay commented Mar 14, 2019

eeejay commented Mar 14, 2019


		let output = ''

		// TODO: do this via a call to GitRepository (needs to be implemented).

RFC: use git ls to crawl all files in the project #367

RFC: use git ls to crawl all files in the project #367

Conversation

rafeca commented Feb 27, 2019 • edited Loading

Summary

Tradeoffs

Alternative solutions

Next potential steps

lee-dohm commented Feb 27, 2019

rafeca commented Feb 28, 2019

smashwilson left a comment

Choose a reason for hiding this comment

smashwilson Feb 28, 2019

Choose a reason for hiding this comment

smashwilson commented Feb 28, 2019

rafeca commented Feb 28, 2019

rafeca commented Feb 28, 2019

maxbrunsfeld commented Mar 1, 2019 • edited Loading

rafeca commented Mar 12, 2019

eeejay commented Mar 14, 2019

eeejay commented Mar 14, 2019

rafeca commented Feb 27, 2019 •

edited

Loading

maxbrunsfeld commented Mar 1, 2019 •

edited

Loading