drive_find() does not always return exactly the same files #288

jennybc · 2020-01-15T04:17:50Z

The more difficult inverse problem addressed by e56b3f5. In that commit, I check for and eliminate any results (file IDs) that appear in more than 1 page. But I fear the opposite can also happen: there can be files that appear in no page.

This is basically a re-opening of #277.

I have seen this myself now. It's hard to make a clean reprex, as the results are stochastic.

Summary: I can do drive_find() repeatedly and get a different number of files. Most recently, with my personal Google identity, I saw 972, 988, 988, and 988 files. (Note to self: In 3 out of 4 requests, I saw higher numbers reported in "Items so far", so the filter for duplicates is actually eliminating replicated IDs.)

The "missing" 988 - 972 = 16 files are a subset of the 22 files in one specific folder.

I will attempt to create a clean example of this using curl alone (no R) and, if successful, report/ask upstream.

The text was updated successfully, but these errors were encountered:

jennybc · 2020-01-15T17:29:53Z

If the exhaustiveness of paginated results is not 100% guaranteed (I believe this may be true), here is one user behaviour that will at least help:

Use the most precise method possible for specifying a target file or for finding files.

If targeting 1 file, this means specifying the file by ID (or anything that contains the ID), as opposed to by name.

If using drive_find() to find several files, use the q clause to trigger server-side filtering and make that clause as specific as possible. Minimize use of the pattern argument, which filters local results after a mass retrieval.

These measures dramatically reduce the reliance on combining a large number of paged results.

tpbarrette · 2020-01-27T17:30:26Z

I came across this issue while attempting to use Google Drive as a collaborative folder for a project I am working on. I am providing some details about my experience with the hopes that they may help someone with more coding expertise than I possess to address the root cause of the issue.

I am using drive_find to create a list of files that with names fitting a criteria using the "pattern = XXX". Based on the above comment, I realize this is not ideal, but I have not had much luck with the q clause since the part of the file name I am interested in is at the end of the file name. Using this approach to search through files that have been shared with me, I get a tibble of files that varies in length nearly every time I try it. However, when I try this approach focused on files where I am the owner, I get consistent results.

The most major caveat to my results so far is that many more files were present to sift though in the shared file search than in the owned file search. I will be examining this issue further moving forward.

jennybc · 2020-01-27T22:35:25Z

@tpbarrette You can use pattern and q clause at the same time. So do as much as you can with the q clause (narrowing the search as much as possible). And then use the pattern to finish the specification, e.g. require a certain pattern at the end the of the filename.

tpbarrette · 2020-01-27T23:30:45Z

@jennybc Thanks for the tip. Unfortunately, the combination of pattern and q still don't seem to provide consistent results. I also tried a fullText q clause, but I think that the file extension (.att) and drive_find do not seem to work well together for this particular use (Client Error: (403) Forbidden). As a little bit of extra information, once I download these .att files from Google Drive, I am able to use read.csv to bring them into R and manipulate them.

In regards to the latter part of my previous comment, have you noticed any difference in the drive_find() results between owned files and files that have been shared with you?

jennybc · 2020-01-28T00:05:16Z

have you noticed any difference in the drive_find() results between owned files and files that have been shared with you?

I haven't studied that directly. But it's true that the files that went missing for me in drive_find() are shared with me (vs owned by me).

bshor · 2020-02-03T16:36:50Z

I used q to narrow files to Google Sheets, and then pattern to search among them. With ~500 files, it only took a few seconds to complete.

jennybc · 2020-02-03T16:48:33Z

I used q to narrow files to Google Sheets

Just FYI, googlesheets4 has a wrapper that does exactly this:

https://googlesheets4.tidyverse.org/reference/sheets_find.html

VectorPosse · 2020-09-02T05:50:12Z

I think my situation may be related to this issue. (If not, let me know and I'll open a separate issue.)

I have a directory of files that I search with drive_find using a pattern that is unique to all the files in this one directory. (They're all named using a specific convention.) In a single R session, I do not see the number varying as described above. However, across separate R sessions separated by a week (the schedule I'm using to grab the files), I can see that not all the files that qualify to be found are actually found.

Here's the weird part: if I go to Google Drive and open up a file that is missing from the drive_find results, close it again, and then run drive_find again, the file magically appears in the output.

I wish I could create some kind of reprex here. I can reliably get files to become "found" by opening them and closing them. But I can't predict which files will be found on the first go in any given week.

UPDATE: Wait, I can get the number to change. The reason? Because the same set of files is getting found, but some files are found more than once!

jennybc · 2020-09-02T15:02:31Z

I'm basically convinced that unfiltered file listing is always going to be vulnerable to the results not being exhaustive. I think this is just a property of the server, i.e. Google's back end. And yeah, it seems to have something to do with whether you own the file or its shared with you and how recently anyone has created/modified/viewed the file.

So if it's very, very important to get an exact set of files, you really need a workflow based on file IDs or on a precise q clause search. Note that you can list a directory with q = "'1234567' in parents", which is essentially what drive_ls() does.

@VectorPosse if you want to find the files in a specific folder, I advise that you use drive_ls() instead of a file name (pattern) based search.

UPDATE: Wait, I can get the number to change. The reason? Because the same set of files is getting found, but some files are found more than once!

In there most recent version of googledrive (v1.0.1), we should be eliminating those duplicates before returning results because of this commit:

e56b3f5

So maybe you need to update googledrive?

Otherwise, I would be very interested to see a case of duplicated file IDs in your result. Remember, it is possible to have multiple distinct files with the same name.

VectorPosse · 2020-09-07T21:21:23Z

Thanks, @jennybc. I ended up using drive_find still as specifying q = "name contains '[stuff here]'" worked like a charm!

jennybc · 2021-05-26T21:39:20Z

I feel like we've ruminated on this as much as we can and there's really nothing to be done about it. The main advice is to make one's queries as specific as possible, in ways that will route through the q search parameter or request specific file ids, as opposed to vague, unconstrained searches.

I'm also not hearing any complaints about this lately, so maybe the vexing behaviour has diminished or been eliminated on the Google side.

aratikrish · 2023-10-11T17:38:20Z

FYI this issue still persists while using drive_ls on a shared drive with multiple layers of subfolders and/or files that are nto owned by me with the option recursive=TRUE. I could not get it to give consistent results even by using a q clause.
The best way to access specific files seems to be to get the sub folder ids using drive_ls multiple times (thsi gives consistent results if the recursive = TRUE option is not used. ) and then using drive_ls at the lowest folder levels to get details of all the files contained within each. The 'name' column of the resultign file can then be filtered using dplyr to get files whose names match a specific pattern.

eamcvey · 2024-05-01T17:52:06Z

I have also experienced this with a large number of subfolders and the recursive option. My solution has been to run drive_ls multiple times and take the superset of the results. This is very slow, however -- just running drive_ls once in my case is extremely slow.

I have not been able to speed things up by using q to specify queries so I suspect it's the scope and recursiveness that makes this a problem. The only thing that works is to eliminate the recursiveness. This is much, much faster, but would require adding a lot of complexity to my code and eliminates the ability to do something important in my particular use case (checking for true duplicate or duplicately named files that could be in other folders).

It's frustrating that something I'd expect to be really fast in a normal file system is so slow with Google Drive!

jennybc mentioned this issue Jan 15, 2020

drive_ls returning variable number of files #277

Closed

jennybc mentioned this issue Feb 3, 2020

Error with drive_get() #281

Closed

jennybc mentioned this issue Apr 3, 2020

drive_ls output is wrong #303

Closed

jennybc closed this as completed May 26, 2021

jennybc mentioned this issue Mar 27, 2022

drive_ls() returns different number of files every call (large folder) #391

Closed

jennybc mentioned this issue Apr 5, 2023

drive_link() does not work properly with service account #423

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

drive_find() does not always return exactly the same files #288

drive_find() does not always return exactly the same files #288

jennybc commented Jan 15, 2020

jennybc commented Jan 15, 2020 •

edited

Loading

tpbarrette commented Jan 27, 2020 •

edited

Loading

jennybc commented Jan 27, 2020

tpbarrette commented Jan 27, 2020 •

edited

Loading

jennybc commented Jan 28, 2020

bshor commented Feb 3, 2020

jennybc commented Feb 3, 2020

VectorPosse commented Sep 2, 2020 •

edited

Loading

jennybc commented Sep 2, 2020

VectorPosse commented Sep 7, 2020

jennybc commented May 26, 2021

aratikrish commented Oct 11, 2023

eamcvey commented May 1, 2024 •

edited

Loading

drive_find() does not always return exactly the same files #288

drive_find() does not always return exactly the same files #288

Comments

jennybc commented Jan 15, 2020

jennybc commented Jan 15, 2020 • edited Loading

tpbarrette commented Jan 27, 2020 • edited Loading

jennybc commented Jan 27, 2020

tpbarrette commented Jan 27, 2020 • edited Loading

jennybc commented Jan 28, 2020

bshor commented Feb 3, 2020

jennybc commented Feb 3, 2020

VectorPosse commented Sep 2, 2020 • edited Loading

jennybc commented Sep 2, 2020

VectorPosse commented Sep 7, 2020

jennybc commented May 26, 2021

aratikrish commented Oct 11, 2023

eamcvey commented May 1, 2024 • edited Loading

jennybc commented Jan 15, 2020 •

edited

Loading

tpbarrette commented Jan 27, 2020 •

edited

Loading

tpbarrette commented Jan 27, 2020 •

edited

Loading

VectorPosse commented Sep 2, 2020 •

edited

Loading

eamcvey commented May 1, 2024 •

edited

Loading