Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

drive_find() does not always return exactly the same files #288

Closed
jennybc opened this issue Jan 15, 2020 · 13 comments
Closed

drive_find() does not always return exactly the same files #288

jennybc opened this issue Jan 15, 2020 · 13 comments

Comments

@jennybc
Copy link
Member

jennybc commented Jan 15, 2020

The more difficult inverse problem addressed by e56b3f5. In that commit, I check for and eliminate any results (file IDs) that appear in more than 1 page. But I fear the opposite can also happen: there can be files that appear in no page.

This is basically a re-opening of #277.

I have seen this myself now. It's hard to make a clean reprex, as the results are stochastic.

Summary: I can do drive_find() repeatedly and get a different number of files. Most recently, with my personal Google identity, I saw 972, 988, 988, and 988 files. (Note to self: In 3 out of 4 requests, I saw higher numbers reported in "Items so far", so the filter for duplicates is actually eliminating replicated IDs.)

The "missing" 988 - 972 = 16 files are a subset of the 22 files in one specific folder.

I will attempt to create a clean example of this using curl alone (no R) and, if successful, report/ask upstream.

@jennybc
Copy link
Member Author

jennybc commented Jan 15, 2020

If the exhaustiveness of paginated results is not 100% guaranteed (I believe this may be true), here is one user behaviour that will at least help:

Use the most precise method possible for specifying a target file or for finding files.

If targeting 1 file, this means specifying the file by ID (or anything that contains the ID), as opposed to by name.

If using drive_find() to find several files, use the q clause to trigger server-side filtering and make that clause as specific as possible. Minimize use of the pattern argument, which filters local results after a mass retrieval.

These measures dramatically reduce the reliance on combining a large number of paged results.

@tpbarrette
Copy link

tpbarrette commented Jan 27, 2020

I came across this issue while attempting to use Google Drive as a collaborative folder for a project I am working on. I am providing some details about my experience with the hopes that they may help someone with more coding expertise than I possess to address the root cause of the issue.

I am using drive_find to create a list of files that with names fitting a criteria using the "pattern = XXX". Based on the above comment, I realize this is not ideal, but I have not had much luck with the q clause since the part of the file name I am interested in is at the end of the file name. Using this approach to search through files that have been shared with me, I get a tibble of files that varies in length nearly every time I try it. However, when I try this approach focused on files where I am the owner, I get consistent results.

The most major caveat to my results so far is that many more files were present to sift though in the shared file search than in the owned file search. I will be examining this issue further moving forward.

@jennybc
Copy link
Member Author

jennybc commented Jan 27, 2020

@tpbarrette You can use pattern and q clause at the same time. So do as much as you can with the q clause (narrowing the search as much as possible). And then use the pattern to finish the specification, e.g. require a certain pattern at the end the of the filename.

@tpbarrette
Copy link

tpbarrette commented Jan 27, 2020

@jennybc Thanks for the tip. Unfortunately, the combination of pattern and q still don't seem to provide consistent results. I also tried a fullText q clause, but I think that the file extension (.att) and drive_find do not seem to work well together for this particular use (Client Error: (403) Forbidden). As a little bit of extra information, once I download these .att files from Google Drive, I am able to use read.csv to bring them into R and manipulate them.

In regards to the latter part of my previous comment, have you noticed any difference in the drive_find() results between owned files and files that have been shared with you?

@jennybc
Copy link
Member Author

jennybc commented Jan 28, 2020

have you noticed any difference in the drive_find() results between owned files and files that have been shared with you?

I haven't studied that directly. But it's true that the files that went missing for me in drive_find() are shared with me (vs owned by me).

@bshor
Copy link

bshor commented Feb 3, 2020

I used q to narrow files to Google Sheets, and then pattern to search among them. With ~500 files, it only took a few seconds to complete.

@jennybc
Copy link
Member Author

jennybc commented Feb 3, 2020

I used q to narrow files to Google Sheets

Just FYI, googlesheets4 has a wrapper that does exactly this:

https://googlesheets4.tidyverse.org/reference/sheets_find.html

@VectorPosse
Copy link

VectorPosse commented Sep 2, 2020

I think my situation may be related to this issue. (If not, let me know and I'll open a separate issue.)

I have a directory of files that I search with drive_find using a pattern that is unique to all the files in this one directory. (They're all named using a specific convention.) In a single R session, I do not see the number varying as described above. However, across separate R sessions separated by a week (the schedule I'm using to grab the files), I can see that not all the files that qualify to be found are actually found.

Here's the weird part: if I go to Google Drive and open up a file that is missing from the drive_find results, close it again, and then run drive_find again, the file magically appears in the output.

I wish I could create some kind of reprex here. I can reliably get files to become "found" by opening them and closing them. But I can't predict which files will be found on the first go in any given week.

UPDATE: Wait, I can get the number to change. The reason? Because the same set of files is getting found, but some files are found more than once!

@jennybc
Copy link
Member Author

jennybc commented Sep 2, 2020

I'm basically convinced that unfiltered file listing is always going to be vulnerable to the results not being exhaustive. I think this is just a property of the server, i.e. Google's back end. And yeah, it seems to have something to do with whether you own the file or its shared with you and how recently anyone has created/modified/viewed the file.

So if it's very, very important to get an exact set of files, you really need a workflow based on file IDs or on a precise q clause search. Note that you can list a directory with q = "'1234567' in parents", which is essentially what drive_ls() does.

@VectorPosse if you want to find the files in a specific folder, I advise that you use drive_ls() instead of a file name (pattern) based search.

UPDATE: Wait, I can get the number to change. The reason? Because the same set of files is getting found, but some files are found more than once!

In there most recent version of googledrive (v1.0.1), we should be eliminating those duplicates before returning results because of this commit:

e56b3f5

So maybe you need to update googledrive?

Otherwise, I would be very interested to see a case of duplicated file IDs in your result. Remember, it is possible to have multiple distinct files with the same name.

@VectorPosse
Copy link

Thanks, @jennybc. I ended up using drive_find still as specifying q = "name contains '[stuff here]'" worked like a charm!

@jennybc
Copy link
Member Author

jennybc commented May 26, 2021

I feel like we've ruminated on this as much as we can and there's really nothing to be done about it. The main advice is to make one's queries as specific as possible, in ways that will route through the q search parameter or request specific file ids, as opposed to vague, unconstrained searches.

I'm also not hearing any complaints about this lately, so maybe the vexing behaviour has diminished or been eliminated on the Google side.

@aratikrish
Copy link

FYI this issue still persists while using drive_ls on a shared drive with multiple layers of subfolders and/or files that are nto owned by me with the option recursive=TRUE. I could not get it to give consistent results even by using a q clause.
The best way to access specific files seems to be to get the sub folder ids using drive_ls multiple times (thsi gives consistent results if the recursive = TRUE option is not used. ) and then using drive_ls at the lowest folder levels to get details of all the files contained within each. The 'name' column of the resultign file can then be filtered using dplyr to get files whose names match a specific pattern.

@eamcvey
Copy link

eamcvey commented May 1, 2024

I have also experienced this with a large number of subfolders and the recursive option. My solution has been to run drive_ls multiple times and take the superset of the results. This is very slow, however -- just running drive_ls once in my case is extremely slow.

I have not been able to speed things up by using q to specify queries so I suspect it's the scope and recursiveness that makes this a problem. The only thing that works is to eliminate the recursiveness. This is much, much faster, but would require adding a lot of complexity to my code and eliminates the ability to do something important in my particular use case (checking for true duplicate or duplicately named files that could be in other folders).

It's frustrating that something I'd expect to be really fast in a normal file system is so slow with Google Drive!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants