Skip to content
This repository was archived by the owner on Dec 15, 2022. It is now read-only.

RFC: use git ls to crawl all files in the project #367

Closed
wants to merge 1 commit into from

Conversation

rafeca
Copy link
Contributor

@rafeca rafeca commented Feb 27, 2019

Summary

This is some code to demonstrate the usage of git ls-files on the fuzzy finder, so we can discuss if it makes sense to implement this solution.

In terms of performance, this PR provides huge benefits for large repositories: for example it makes the crawling 5X faster when opening the gecko-dev repository (it goes from 57s in the current implementation after merging #366 to 11s).

Tradeoffs

Using git ls-files has to important tradeoffs:

  1. It does not support symlinks (as stated in Use git ls-files for loading paths when available. #301). To overcome this tradeoff I've decided to only enable this codepath whenever traverseSymlinkDirectories is false (which unfortunately would not happen often since it's enabled by default).
  2. It does not support git submodules. This means that on a repo that has submodules the files inside these won't be accessible by the fuzzy finder.

The second tradeoff is quite a big deal, so if we ever want to enable this crawler we would want to have it under some kind of feature flag that could be enabled on the settings (e.g "Enable fast mode in fuzzy finder", showing some messaging around the tradeoffs of this mode.

Additionally, in order to make this mode more discoverable we could show some kind of prompt on the Atom UI whenever somebody opens a large project: if on the first crawling we detect that the repo has more than e.g 20K files we show a message suggesting enabling the fast mode.

Alternative solutions

This is a very ad-hoc solution for a very specific problem: It does not fix the current architectural issues on the Fuzzy Finder around file watching, recrawling, etc.

If we want to invest some significant time into making an even greater Fuzzy Finder it would make more sense to address the architectural issues by creating some sort of global data structure that holds in memory all the project files and watches them for changes.

That "virtual filesystem" could probably be built on top of @atom/watcher, and could be used by other packages like the Tree View.

It's important to note that this alternative is way more complex to implement and would take much longer to ship.

Next potential steps

This is just a WIP code to get some signal about whether we want to invest more on this path. if we want to move forward, there are a few things (some of them not trivial) that we'd need to do to be able to ship this:

  • Implement the missing APIs on git-utils and libgit2 to avoid spawning a git process.
  • Add some kind of setting to enable this fast mode.
  • Make the setting discoverable by adding a prompt on large projects.

This PR makes use of git ls-files to speed-up the startup of the fuzzy
finder for large repos when it can output similar results than the
previous crawler.

This basically happens when the option to follow symlinks is disabled
(since `git ls-files` does not follow them).
@rafeca rafeca changed the title WIP: use git ls to crawl all files in the project RFC: use git ls to crawl all files in the project Feb 27, 2019
@lee-dohm
Copy link
Contributor

If I understand this correctly, it would also not improve things if the project is not being tracked by Git?

@rafeca
Copy link
Contributor Author

rafeca commented Feb 28, 2019

If I understand this correctly, it would also not improve things if the project is not being tracked by Git?

Exactly, but currently the biggest perf bottleneck on the fuzzy finder happens only on Git projects due to the checks to see if a file is ignored by Git, which takes more than 80% of the time (more info here).

Copy link
Contributor

@smashwilson smashwilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to invest some significant time into making an even greater Fuzzy Finder it would make more sense to address the architectural issues by creating some sort of global data structure that holds in memory all the project files and watches them for changes.

That "virtual filesystem" could probably be built on top of @atom/watcher, and could be used by other packages like the Tree View.

It's important to note that this alternative is way more complex to implement and would take much longer to ship.

Interesting! @nathansobo and @as-cii started to take an approach similar to this over in xray. You're correct that it takes a pretty substantial amount of engineering to get "right". Filesystems are bags of holding for edge cases. That's a good argument for the benefits of centralizing it, but I'd guess it would take a solid few months to design and put in place.

I wonder if there's some intermediate space we could explore that might address some of the tradeoffs you've identified without dedicating that amount of effort. What if we built our own native Node module that implemented .gitignore-compatible path filtering, but traversed symlinks and into submodules?


let output = ''

// TODO: do this via a call to GitRepository (needs to be implemented).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have quite a bit of infrastructure within atom/github to interact with external git processes - with queueing, caching, and a sidecar worker process to fix a sporadic performance issue with child_process calls... see atom/github#386 and atom/github#688 for some background on that particular mess 😆 . We also bundle a git executable, so we don't need to rely on a pre-existing user installation. Rather than calling GitRepository, we should publish a package service that allows other packages like this one to make git calls on demand. See atom/github#1089 for some thoughts on this - this might be a good motivating case 😄

@smashwilson
Copy link
Contributor

Another possibility: one of the "holy grails" for find-and-replace has been to integrate with ripgrep as a backend. ripgrep is blazing fast and solves a lot of long-standing annoyances around find-and-replace. The downside is that using ripgrep as a backend would probably involve a near-total rewrite of find-and-replace. It's likely still less effort than designing a virtual filesystem model, though?

@rafeca
Copy link
Contributor Author

rafeca commented Feb 28, 2019

Thanks for the comments @smashwilson !

Interesting! @nathansobo and @as-cii started to take an approach similar to this over in xray. You're correct that it takes a pretty substantial amount of engineering to get "right". Filesystems are bags of holding for edge cases. That's a good argument for the benefits of centralizing it, but I'd guess it would take a solid few months to design and put in place.

Yup, completely agree. I would implement the virtual filesystem as an abstraction layer on the actual fs, with just a subset of the available APIs. Still, designing something like that takes a substantial amount of time and the user benefits (apart from the architectural ones) are not clear.

I wonder if there's some intermediate space we could explore that might address some of the tradeoffs you've identified without dedicating that amount of effort. What if we built our own native Node module that implemented .gitignore-compatible path filtering, but traversed symlinks and into submodules?

I'd try to avoid doing this: reimplementing a file tree traverser that takes care of .gitignore rules would be quite a bit of effort as well (we would have to replicate git's behaviour around nested .gitignore rules, user-defined rules, etc), and we won't really know what perf gains will this give us until it's done (probably we would have to put a lot of effort on optimizations to reach git perf).

Another possibility: one of the "holy grails" for find-and-replace has been to integrate with ripgrep as a backend. ripgrep is blazing fast and solves a lot of long-standing annoyances around find-and-replace. The downside is that using ripgrep as a backend would probably involve a near-total rewrite of find-and-replace. It's likely still less effort than designing a virtual filesystem model, though?

I really like this solution! ripgrep should give us a lot of performance out of the box, and it should not be hard to hack on a prototype to get actual improvement values (so then we can decide if it's worth it).

@rafeca
Copy link
Contributor Author

rafeca commented Feb 28, 2019

fwiw, git ls-files has an option to traverse submodules, but it's currently incompatible with the option that returns untracked files.

@maxbrunsfeld
Copy link
Contributor

maxbrunsfeld commented Mar 1, 2019

Super excited about this. I like the ripgrep approach because it would help both find-and-replace and fuzzy-finder. You can also still use it if the user turns off gitignore mode and wants to see all files (I forget what the setting is called; it’s different in the two packages). VS code also already has code for parsing ripgrep’s output I believe.

@rafeca
Copy link
Contributor Author

rafeca commented Mar 12, 2019

I've created #369 as a follow-up of this discussion, thanks everyone for the suggestions! ✨

@rafeca rafeca closed this Mar 12, 2019
@eeejay
Copy link

eeejay commented Mar 14, 2019

I proposed this exact change 2 years ago in pull request #301. It was rejected because it didn't handle symlinks. I don't think this change does that either.

@eeejay
Copy link

eeejay commented Mar 14, 2019

Ah, I just saw that in the tradeoffs. nm!

@rafeca rafeca added the FY2019Q5 atom perf More information: https://github.com/github/pe-atom-log/issues/728 label Mar 19, 2019
@rafeca rafeca deleted the wip-use-git-ls branch March 19, 2019 11:28
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
FY2019Q5 atom perf More information: https://github.com/github/pe-atom-log/issues/728
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants