Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Google Photos takeout sometimes missing media extension in supplemental-metadata.json file #674

Closed
jfly opened this issue Feb 4, 2025 · 8 comments

Comments

@jfly
Copy link

jfly commented Feb 4, 2025

For example, see:

$ unzip -l ../takeout-20250203T030726Z-001.zip | grep "Peanut Butter Balls"
  7622466  02-02-2025 19:09   Takeout/Google Photos/Photos from 2016/Peanut Butter Balls.jpg
      710  02-02-2025 19:09   Takeout/Google Photos/Photos from 2016/Peanut Butter Balls.supplemental-metadata.json

Note how Peanut Butter Balls.supplemental-metadata.json does not include the jpg extension, which confounds all our matchers.

I'm not sure if there's a good fix for this. Would it be too risky to also try removing the media extension to look for a match?

@jfly
Copy link
Author

jfly commented Feb 4, 2025

More examples:

$ unzip -l ../takeout-20250203T030726Z-001.zip | grep 2234089303984509579
   820428  02-02-2025 19:10   Takeout/Google Photos/Photos from 2014/2234089303984509579.jpg
   826415  02-02-2025 19:10   Takeout/Google Photos/Photos from 2014/2234089303984509579-edited.jpg
      703  02-02-2025 19:10   Takeout/Google Photos/Photos from 2014/2234089303984509579.supplemental-metadata.json
$ unzip -l ../takeout-20250203T030726Z-001.zip | grep 1190642159672506179
  1371895  02-02-2025 19:11   Takeout/Google Photos/Photos from 2014/1190642159672506179.jpg
      656  02-02-2025 19:11   Takeout/Google Photos/Photos from 2014/1190642159672506179.supplemental-metadata.json
  1419260  02-02-2025 19:11   Takeout/Google Photos/Photos from 2014/1190642159672506179-edited.jpg

I appear to have 15 examples of this in my full Google Photos Takeout archive.

@jfly
Copy link
Author

jfly commented Feb 4, 2025

Edit: nevermind, this approach doesn't work. It causes issues for files that shouldn't match because of indices (such as DSC_0103.JPG(1) in our tests).

For anyone watching, I worked around this locally by tweaking matchFastTrack to just keep trimming extensions until it either runs out of extensions to trim or finds a match:

func matchFastTrack(jsonName string, fileName string, _ filetypes.SupportedMedia) bool {
// fast track: if the file name is the same as the JSON name
jsonName = strings.TrimSuffix(jsonName, path.Ext(jsonName))
return jsonName == fileName
}

@simulot
Copy link
Owner

simulot commented Feb 4, 2025

Interesting

The takeout system has already difficulties to handle long file names, making the pairing of JSON with image tricky.
Now, they add unnecessary long suffixes, making the pairing harder...

The latest RC addresses this problem. But your sample is new for me. So I have to refactor the matcher routines

@simulot
Copy link
Owner

simulot commented Feb 4, 2025

You can help by providing the list of the files as delivered in the takeout files.
https://github.com/simulot/immich-go/blob/main/docs/how-to-send-debug-data.md

For privacy, you can DM the list on my discord account @simulot

@jfly
Copy link
Author

jfly commented Feb 5, 2025

I'm sorry, but I don't feel comfortable sharing the full list, even privately over discord. I do want to help out, though. Are any of these alternatives acceptable?

  1. Would it be OK if I share the list, but with the album names redacted?
  2. Is there something in particular you're looking for besides the snippets I've included above? Happy to look for that and share what I can find.

@simulot
Copy link
Owner

simulot commented Feb 5, 2025

Sure, your feeling is perfectly valid.

The need is to have realistic set of photos and related JSON files as delivered by the takeout procedure. I just need the file names, not their content. Apparently the length of the file's base name matter, not the path.

I have found some strange results with names having UTF-8 chars.

Until now, I have encountered

20161105_170829.jpg.supplemental-metadata.json
Screenshot_20231027_123303_Facebook.jpg.supple.json
Screenshot_20231027_123303_Facebook.jpg.supple(1).json

Your example shows another way to mangle long file names

Peanut Butter Balls.supplemental-metadata.json
2234089303984509579.supplemental-metadata.json

The JSON file structure seams to be unchanged.

Thank you for your input.

@jfly
Copy link
Author

jfly commented Feb 5, 2025

Sorry, I'm not quite following. Would it be useful for me to send you a full list of files, but with the album names redacted? Or are you saying that's not useful because even if I preserve the number of characters in the album name, it might not be the same number of bytes in a utf8 encoding, and that's what you really need to know?

@simulot
Copy link
Owner

simulot commented Feb 5, 2025

The album name doesn't matter AFAIK. You can redact them.

It's better to not redact file names. You can just omit files name with private information

More samples you give, more chances I have to determine a rule that works with your case while not breaking the existing ones

simulot added a commit that referenced this issue Feb 14, 2025
* test: add cases for Google Photos supplemental metadata files

* refactor: update matching logic for Google Photos supplemental metadata files

Fixes  #698, #673, #674, #652

* refactor: fix linter message

* refactor: update linter directive for supplemental metadata matching
@simulot simulot closed this as completed Feb 15, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants