Add option to use `ripgrep` for crawling the list of files #369

rafeca · 2019-03-12T14:59:51Z

Summary

This PR contains some minimal changes to demonstrate the usage of ripgrep on the fuzzy finder, so we can discuss the tradeoffs and steps needed to be able to ship this.

This work comes as a follow-up from #367 after the suggestions of @smashwilson .

Benefits

Using ripgrep speeds up drastically the time to crawl medium and large repositories (we're taking about up to 14X faster times):

Type	Num Files	Time (master)	Time (This PR)	Improvements
Small	2K	1.3s	0.9s	30% less time (or 1.4X faster) 🎉
Medium	30K	6.2s	1.2s	80% less time (or 5X faster) 🎉
Large	270K	57.5s	4.1s	93% less time (or 14X faster) 🎉

(the measurement has been done the same way as in #366).

Possible Drawbacks

The crawling behaviour with ripgrep is slightly different than the one currently implemented. I don't think that the changes are important enough (or even bad), but it's important to list them:

ripgrep also returns symlink destination files (e.g if there's a symlink ./foo.js which points to ./bar.js, with ripgrep foo.js can also be opened. I think this is an improvement.
The ordering of the returned results is slightly different: right now results are returned in a DFS ordering (then alphabetically), while ripgrep returns all results alphabetically ordered.
Currently, if Atom gets opened from a non-git folder which has git repositories in some sub-folders, the .gitignore files from the sub-folders are ignored. With ripgrep they are taken into account. I think that this is an improvement.

Regarding 2., this change is quite noticeable, since the fuzzy finder currently displays the first 10 files returned by the crawler when it gets opened. This means that the first shown files will be slightly different than before (IMO we should change the default list of files that are shown when opening the fuzzy finder to show e.g recently opened files or currently opened files).

Alternate Designs

Other potential designs were presented here, and seems like ripgrep is the best tool to use at this moment.

Regarding the current implementation, I've decided to use vscode-ripgrep, which is just a package that handles the download of the ripgrep binaries for several platforms. This has saved me a ton of time, but I'm not familiar enough with the process to install bundled extensions in Atom to know if this package is suitable for the job.

Next steps

Applicable Issues

#271

rafeca · 2019-03-13T11:43:29Z

Small update: I just realized that when using ripgrep we don't need to initialize the Git native code (see commit) 😃

This is quite slow (specially for large repos) so it gives us even greater performance improvements (we can reduce the times on large repos by another 50% 🚀). I'm gonna update the results table with the new data

rafeca · 2019-03-13T12:05:30Z

I think it's good to share a couple of screencasts to compare the previous experience on very large repos (gecko-dev) with the one after this PR:

What happens now in master after opening Atom (or focusing Atom back):

The same flow with this PR:

jasonrudolph · 2019-03-13T16:15:42Z

Currently, if Atom gets opened from a non-git folder which has git repositories in some sub-folders, the .gitignore files from the sub-folders are ignored. With ripgrep they are taken into account. I think that this is an improvement.

I consider this a huge improvement. 😍

I think it's good to share a couple of screencasts to compare the previous experience on very large repos (gecko-dev) with the one after this PR...

Wow! 🏎💨

Next steps...

Thanks for listing these out. This seems promising and very much worth pursuing. ⚡

Specially when using `ripgrep`, we don't want to spawn a lot of processes in parallel if the user has many projects opened.

The modified test was relying on the order that two crawlers that are running in parallel start returning results. This may have probably caused flakiness and unintended failures. In order to fix this, I'm forcing the crawling system to run sequentially, by faking that the system has a single CPU.

Since the ripgrep binary should be unpacked from the asar archive (we need to remember to add it to https://github.com/atom/atom/blob/master/script/lib/package-application.js#L101-L111 once we incorporate ripgrep onto Atom), we need to patch the path that gets resolved by the electron built-in require system.

rafeca · 2019-03-14T15:16:49Z

The testing so far seems to be going well:

I've modified the test suite to run all the fuzzy finder tests both with and without ripgrep, this way we have good coverage for ripgrep.
The functionality works fine on Windows and on different file systems.
I've been able to test a prod build on OSX and in fact I had to do a small fix to support asar archives.
I haven't been able to build Atom on Windows yet... This is the last bit of remaining work.

A couple of drawbacks that I've discovered today:

The ripgrep binary is quite big (~5MB for OSX, a little bit less for Windows). Once it gets gzipped its size gets reduced to ~1.8MB, which means that the install files of Atom will grow by ~1.25% (based on the current size of ~144MB).
Currently, vscode-ripgrep downloads the needed binary from GitHub during installation (code). This causes failures on CI systems due to rate limits. I've fixed that for now by putting setting a GitHub token of mine as an env var on the travis config for fuzzy-finder. We should either do the same in atom/atom or stop using vscode-ripgrep and build (and maintain) our own package.

@jasonrudolph any thoughts about these drawbacks?

jasonrudolph · 2019-03-14T15:28:00Z

Nice work, @rafeca!

The ripgrep binary is quite big (~5MB for OSX, a little bit less for Windows). Once it gets gzipped its size gets reduced to ~1.8MB, which means that the install files of Atom will grow by ~1.25% (based on the current size of ~144MB).

This seems like a worthwhile tradeoff to me. We're trading a relatively tiny bit of additional disk space and bandwidth, and it turn we spend less time waiting for fuzzy-finder results. 👍

I've fixed that for now by putting setting a GitHub token of mine as an env var on the travis config for fuzzy-finder. We should either do the same in atom/atom or stop using vscode-ripgrep and build (and maintain) our own package.

We can generate a scopeless API token for our build account and use that token for atom/fuzzy-finder builds and atom/atom builds. Let's coordinate via Slack to make this happen.

rafeca · 2019-03-14T20:54:33Z

Awesome! I agree it's a worthwhile tradeoff 😃

I've been able to verify that Windows builds are packaged correctly with ripgrep, which was the last item from my list of things to consider... So this PR is ready for review 😄

rafeca · 2019-03-14T21:08:14Z

Oh there's something else, this feature is behind a config param which is off by default. Since it's a fuzzy-finder config param, the setting is quite hidden and really hard to discover (you have to go to settings -> packages -> search fuzzy finder and click on its settings.

This is how this setting looks like:

I have mixed feelings about this:

The benefit is that by having the setting we can find early issues from people that opt in before we turn it on by default.
The problem is that the setting is so hidden that I don't expect many people to turn it on, so we're delaying the impact of it until we make it on by default (which takes at least one month) without getting much more testing in exchange.

Couple of things that we could do:

Turn it on by default (and remove the "experimental" wording from the setting). If under some situations something goes wrong we can tell people to turn it off as a workaround.
Leave it as it is: one month is not that much time and even a little bit of extra testing is appreciated.
Leave it as off but implement a popup that prompts users to turn it on if we detect that they open a medium or large repository. This is my favorite one, but if this takes much effort to implement then I don't think it's worth it.

winstliu · 2019-03-15T03:05:58Z

Another time we implemented an experimental setting was when we started using fuzzaldrin-plus as an alternative to fuzzaldrin. The setting was off by default - you'll actually notice it right above your new setting in the last screenshot you posted.

Whenever people would open issues noting odd fuzzy finding behavior, we would ask them to try turning on the setting and reporting back. I think we can do the same thing here - keep the setting off by default, but advertise that the setting exists and will become the default in the near future (as we did for tree-sitter).

Looking back, what I think we could have done better is set a timeline for when to enable the new scoring method by default. As you can see, it's been three or four years, and I'm still not sure if it's enabled by default yet on all the packages that have it as an option. That was something we handled much better with tree-sitter.

I would opt towards making it off by default, advertising it when we can, and determining a target version to have it on by default. If by that target version it seems like it's still not ready for prime time, then we can re-evaluate.

rafeca · 2019-03-15T10:22:15Z

@50Wliu that sounds good to me.

If I understand correctly our release cadence, if this PR lands now it'll get included in v1.37.0 which will be released around May. We could aim to enable it by default by v1.38.0 or v1.39.0 depending on the feedback/issues reported by users (this can be decided later though).

jasonrudolph · 2019-03-18T21:27:45Z

Oh there's something else, this feature is behind a config param which is off by default. Since it's a fuzzy-finder config param, the setting is quite hidden and really hard to discover...

Couple of things that we could do:

...

...

Leave it as off but implement a popup that prompts users to turn it on if we detect that they open a medium or large repository. This is my favorite one, but if this takes much effort to implement then I don't think it's worth it.

I love this idea! It doesn't need to be part of this PR, but I think it's worth trying in a follow-up PR. I'd be happy to help. 😄

For what it's worth, I think it will be fairly straightforward to implement using the notification API.

jasonrudolph

Fantastic work, @rafeca.

I noted one super minor phrasing suggestion below.

I can't wait to start using this. 👏⚡

package.json

rafeca · 2019-03-18T22:02:43Z

I love this idea! It doesn't need to be part of this PR, but I think it's worth trying in a follow-up PR. I'd be happy to help. 😄

For what it's worth, I think it will be fairly straightforward to implement using the notification API.

Awesome! We can implement it as a followup 😄

Co-Authored-By: rafeca <rafeca@gmail.com>

black-snow · 2019-05-20T13:13:55Z

Read the release notes, followed the path to this PR, followed your guide where to find the new setting and hit the checkbox. I think there will be more people going down this route :)

rafeca marked this pull request as ready for review March 12, 2019 16:02

rafeca changed the title ~~WIP: Add option to use ripgrep for crawling the list of files~~ Add option to use ripgrep for crawling the list of files Mar 12, 2019

rafeca closed this Mar 12, 2019

rafeca reopened this Mar 12, 2019

rafeca mentioned this pull request Mar 12, 2019

RFC: use git ls to crawl all files in the project #367

Closed

3 tasks

rafeca added 2 commits March 14, 2019 12:52

Add option for using for crawling the list of files

ac3b703

Do not initialize a Git repository when using ripgrep

ee5179e

rafeca force-pushed the use-ripgrep branch from 937a3be to 3d4568b Compare March 14, 2019 11:53

rafeca added 3 commits March 14, 2019 13:48

Limit the number of concurrent crawlers that are run in parallel

54ca122

Specially when using `ripgrep`, we don't want to spawn a lot of processes in parallel if the user has many projects opened.

Run all relevant tests both with ripgrep enabled and disabled

aad4dc7

rafeca force-pushed the use-ripgrep branch from 3d4568b to aad4dc7 Compare March 14, 2019 12:49

rafeca self-assigned this Mar 15, 2019

rafeca mentioned this pull request Mar 15, 2019

Slow startup for large projects #271

Closed

This was referenced Mar 18, 2019

Checking which paths are ignored by Git cripples path searching performance #361

Closed

Fuzzy finder reindex whole project everytime #88

Open

jasonrudolph approved these changes Mar 18, 2019

View reviewed changes

package.json Outdated Show resolved Hide resolved

Tweak description for the rupgrep config param

54e2b39

Co-Authored-By: rafeca <rafeca@gmail.com>

rafeca merged commit 15e6ff4 into master Mar 19, 2019

rafeca deleted the use-ripgrep branch March 19, 2019 11:36

This was referenced Mar 19, 2019

⬆️ fuzzy-finder@1.10.1 atom/atom#19010

Merged

Improve time to find on a large repository by using ripgrep atom/find-and-replace#1075

Closed

This was referenced Apr 1, 2019

Show prompt to enable "fast mode" on large projects #376

Closed

Show prompt to enable "fast mode" on large projects #378

Merged

jasonrudolph mentioned this pull request Apr 11, 2019

[Feedback] Experimental Fast Mode #379

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to use `ripgrep` for crawling the list of files #369

Add option to use `ripgrep` for crawling the list of files #369

rafeca commented Mar 12, 2019 •

edited

Loading

rafeca commented Mar 13, 2019

rafeca commented Mar 13, 2019

jasonrudolph commented Mar 13, 2019

rafeca commented Mar 14, 2019

jasonrudolph commented Mar 14, 2019

rafeca commented Mar 14, 2019

rafeca commented Mar 14, 2019

winstliu commented Mar 15, 2019

rafeca commented Mar 15, 2019 •

edited

Loading

jasonrudolph commented Mar 18, 2019

jasonrudolph left a comment

rafeca commented Mar 18, 2019

black-snow commented May 20, 2019

Add option to use ripgrep for crawling the list of files #369

Add option to use ripgrep for crawling the list of files #369

Conversation

rafeca commented Mar 12, 2019 • edited Loading

Summary

Benefits

Possible Drawbacks

Alternate Designs

Next steps

Applicable Issues

rafeca commented Mar 13, 2019

rafeca commented Mar 13, 2019

jasonrudolph commented Mar 13, 2019

rafeca commented Mar 14, 2019

jasonrudolph commented Mar 14, 2019

rafeca commented Mar 14, 2019

rafeca commented Mar 14, 2019

winstliu commented Mar 15, 2019

rafeca commented Mar 15, 2019 • edited Loading

jasonrudolph commented Mar 18, 2019

jasonrudolph left a comment

Choose a reason for hiding this comment

rafeca commented Mar 18, 2019

black-snow commented May 20, 2019

Add option to use `ripgrep` for crawling the list of files #369

Add option to use `ripgrep` for crawling the list of files #369

rafeca commented Mar 12, 2019 •

edited

Loading

rafeca commented Mar 15, 2019 •

edited

Loading