Alternative to regular expression in -tagsfromfile argument (Google Takeout)

Started by Buju, August 21, 2020, 04:21:41 PM

Previous topic - Next topic

Buju

Hello.  I have downloaded my wife's Google Photos data from Google Takeouts.  There are 60542 images and videos, each with a corresponding json file which contains the real date metadata.

My goal is to put all these images and videos into 1 big folder with the correct creation date in their file names and date modified.  Preserving other dates would be ideal but aren't necessary.

I have this command line I've constructed in a Windows batch file with a lot of research and testing, which gets me most of the way toward my goal:

C:\tools\exiftool\exiftool -r -tagsfromfile "%%d%%F.json" "-GPSAltitude<geodataaltitude"="" "-gpslatitude<geodatalatitude"="" "-gpslatituderef<geodatalatitude"="" "-gpslongitude<geodatalongitude"="" "-gpslongituderef<geodatalongitude"="" "-keywords<tags"="" "-subject<tags"="" "-caption-abstract<description"="" "-imagedescription<description"="" -d "%%s" "-datetimeoriginal<PhotoTakenTimeTimestamp" "-alldates<PhotoTakenTimeTimestamp" "-filemodifydate<PhotoTakenTimeTimestamp" "-filecreatedate<creationtimetimestamp" "-filename<${phototakentimetimestamp;s/(\d{3})\*//;$_=$self->InverseDateTime($_);DateFmt(qq(%%Y-%%m-%%d %%H-%%M-%%S))}%%-c - %%f.%%le" -o "C:/media/image/mami/googlephotos/" "C:/media/image/mami/Google Photos/"

However, I seem to have hit a particularly nasty snag.  Many of the files' file names in the Google Takeouts data set are in a format like IMG_3684(2).JPG or IMG_3684(2).MOV and the corresponding json files for these files are named like IMG_3684.JPG(2).json or IMG_3684.MOV(2).json.  This naming causes exiftool to be unable to find the json file for these files with my command.

I was thinking I could use a regular expression as an argument to -tagsfromfile, but Phil Harvey explicitly says that can't be done in https://exiftool.org/forum/index.php?topic=10967.msg58579#msg58579.

What alternative to using a regular expression (regex; including for thread searchability) as an argument to -tagsfromfile do I have for matching both IMG_3684.JPG(2).json for IMG_3684(2).JPG and IMG_3684.JPG.json for IMG_3684.JPG when looking for the corresponding json file?  I would prefer the final result to not require editing the original data set in any way.  I want it to be preserved exactly as extracted from the Google Takeout archives.  Also, I would prefer to use 1 command line, and not multiple with hardlink creation.

Thank you for any help you can provide.

Once this issue is resolved, I do have a couple others I need to tackle as well, but I'll avoid mentioning those for now to prevent confusion in this thread.

Phil Harvey

Is the file name always 8 characters before the number in brackets is added?

If so, you can do this:  -tagsfromfile "%%d%%8f.%%e%%.8f.json"

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Buju

Unfortunately, it's not.  There are many files that are like 923F775D-9A50-4430-A825-42C672379661.jpeg too.

Phil Harvey

OK then, is the number in the brackets always 1 digit?  if so, you can duplicate the -tagsfromfile and use "%%d%%F%%-.3f.json" for the second one.  But with two -tagsfromfile options, you will also have to duplicate all of the tags that you want to copy, so your command line will be twice as long.  Also, you will get one warning for each file with this method.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Buju

Would that make IMG_3684(2).JPG match IMG_3684.JPG(2).json?  It looks to me like it would match IMG_3684(2).JPG(2).json, but I don't have json file names with that structure in my Google Takeout data set.

StarGeek

There's one very important fact you need to take note of, and that is that Google did not remove any metadata from the images.  If you didn't change any information on the Google Photos website, then you do not need to change anything.

Another thing is the time stamp that's in the json files is in UTC, so you need to adjust the time in order to be accurate.  Also, Google doesn't always get the right time zone when the file is uploaded.  For example, even if you're on EST, Google might think that the file was supposed to be PST.  I could never figure out what Google used to decide the time zone, though if it is uploaded with GPS data, it usually gets it correct.
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).

Buju

Quote from: StarGeek on August 21, 2020, 11:43:12 PM
There's one very important fact you need to take note of, and that is that Google did not remove any metadata from the images.  If you didn't change any information on the Google Photos website, then you do not need to change anything.

Unfortunately, that's not true.  Google did indeed change the file creation and modification dates, as they are all set to 2020, even inside the archive files downloaded from Google Takeout, even though the photos are mostly from 2014 through 2020, with a scattered few from 2007 through 2013.  Every single one of them is set to 2020, so I must use the json data, where the correct dates are stored, to correct the image file dates.  I wish what you were saying was true though.  I was expecting it to be, but Google let me down hard there.

These dates seem to be the dates that the files were uploaded to Google Photos, and my wife did indeed start using Google Photos for the first time this year.  Previously, she was storing them all on her phone and backing them up on her Macbook Pro periodically.  The reason I even have to do this is because Google Backup And Sync decided to upload 150 GB of photos to her free account that has only 15 GB of storage, which disabled all her Google services until we deleted them (which we did, which was a UI nightmare because of the only effective way being the poorly-programmed web browser page).  We don't want to pay for a pointless subscription service when we have plenty of storage across multiple devices, for easy backup redundancy.  Google Backup And Sync should have stopped uploading when it reached the limit, not endlessly uploaded then requested a ransom.  It's nuts.  We're definitely never using Google Photos again.

By the way, the reason I'm going through this Google thing instead of using the original backups is that Google Photos data is the most complete set of photos for us right now.  After my wife started using it, she deleted many photos from her phone, iCloud, and even her computer, because the expectation was that Google Photos would synchronize well without any issues.  Neither of us expected Google Backup And Sync to upload way over the storage limit automatically in the background.

Quote from: StarGeek on August 21, 2020, 11:43:12 PMAnother thing is the time stamp that's in the json files is in UTC, so you need to adjust the time in order to be accurate.  Also, Google doesn't always get the right time zone when the file is uploaded.  For example, even if you're on EST, Google might think that the file was supposed to be PST.  I could never figure out what Google used to decide the time zone, though if it is uploaded with GPS data, it usually gets it correct.

I'm Canadian and my wife is Japanese and spent almost a quarter of her time from 2014 through 2020 in Japan with her family with the rest of her time in Canada, so the file timestamps' time zones could be crazy.  I'm not concerned about fixing those, as my goal is only to use the metadata that's in the json files directly, whatever it is, to roughly sort the photos by date and relative time within each day.  From a cursory look, though, the timestamps in the json files do seem to be correct, as there is a photo that was clearly taken at night in Japan when the sun was down that says Mar 8, 2014, 11:02:42 AM UTC in the PhotoTakenTimeFormatted.  It wouldn't make sense for that to be local time in Japan when that photo was taken.

StarGeek

Quote from: Buju on August 21, 2020, 11:59:31 PM
Unfortunately, that's not true.  Google did indeed change the file creation and modification dates, as they are all set to 2020, even inside the archive files downloaded from Google Takeout, even though the photos are mostly from 2014 through 2020, with a scattered few from 2007 through 2013.  Every single one of them is set to 2020, so I must use the json data, where the correct dates are stored, to correct the image file dates.  I wish what you were saying was true though.  I was expecting it to be, but Google let me down hard there.

FileCreateDate and FileModifyDate are properties of the underlying file system, not embedded data. That data was lost when the file was uploaded. Those timestamps are relatively fragile to begin with. Check the embedded time stamps with
exiftool -G1 -a -s -Time:all /path/to/file/
The actual data embedded by the camera, such as DateTimeOriginal and CreateDate, is still there if they weren't removed before the upload.  If you feel the need, you can copy that data to the file system timestamps with something like
exiftool "-FileCreateDate<DateTimeOriginal" "-FileModifyDate<DateTimeOriginal" /path/to/files/

* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).

Buju

That's great info!  Most of the files do indeed have "CreateDate" intact, and some have way more than that!

However, it still seems like there are issues with that.  I ran C:\tools\exiftool\exiftool -r -G1 -a -s -Time:all --ext json "C:/media/image/mami/Google Photos/" > datetimes.txt to get a text file with ALL the photos' datetimes and found some tricky ones that will still require reading the json because the metadata in the file was somehow removed.  For example:

======== C:/media/image/mami/Google Photos/2014-02-25/1385048_454911877955840_1644321018_n(1).jpg
[System]        FileModifyDate                  : 2020:07:18 12:54:00-07:00
[System]        FileAccessDate                  : 2020:08:19 14:36:07-07:00
[System]        FileCreateDate                  : 2020:08:19 14:36:07-07:00
[ICC-header]    ProfileDateTime                 : 1998:02:09 06:49:00


======== C:/media/image/mami/Google Photos/2014-05-20 #2/IMG_6278(1).jpg
[System]        FileModifyDate                  : 2020:07:18 12:55:56-07:00
[System]        FileAccessDate                  : 2020:08:19 14:26:00-07:00
[System]        FileCreateDate                  : 2020:08:19 14:26:00-07:00
[ICC-header]    ProfileDateTime                 : 1998:02:09 06:49:00


======== C:/media/image/mami/Google Photos/2014-05-24 #2/IMG_6322(1).JPG
[System]        FileModifyDate                  : 2020:07:14 06:23:06-07:00
[System]        FileAccessDate                  : 2020:08:19 14:51:25-07:00
[System]        FileCreateDate                  : 2020:08:19 14:51:25-07:00


======== C:/media/image/mami/Google Photos/2014-05-24 #3/IMG_6320.JPG
[System]        FileModifyDate                  : 2020:07:14 06:23:06-07:00
[System]        FileAccessDate                  : 2020:08:19 14:52:09-07:00
[System]        FileCreateDate                  : 2020:08:19 14:52:09-07:00


This is just a small sample out of the >60000 photos.  With that many photos, I really need to handle all of them together.  Most of these ones have the (1) problem and also require reading the json, so I still need to grab the json somehow.

The good news is, you're right about many of the files.  I see plenty that are like this:

======== C:/media/image/mami/Google Photos/2014-05-30/IMG_6378(1).JPG
[System]        FileModifyDate                  : 2020:07:16 13:46:44-07:00
[System]        FileAccessDate                  : 2020:08:19 14:53:43-07:00
[System]        FileCreateDate                  : 2020:08:19 14:53:43-07:00
[IFD0]          ModifyDate                      : 2014:05:30 08:49:07
[ExifIFD]       DateTimeOriginal                : 2014:05:30 08:49:07
[ExifIFD]       CreateDate                      : 2014:05:30 08:49:07
[ExifIFD]       SubSecTimeOriginal              : 264
[ExifIFD]       SubSecTimeDigitized             : 264
[Composite]     SubSecCreateDate                : 2014:05:30 08:49:07.264
[Composite]     SubSecDateTimeOriginal          : 2014:05:30 08:49:07.264


Unfortunately it looks like with the existence of the other ones which have had their metadata stripped somehow, I'll have to figure out something different still, so I seem to be back at square one.

Phil Harvey

Quote from: Buju on August 21, 2020, 09:58:27 PM
Would that make IMG_3684(2).JPG match IMG_3684.JPG(2).json?  It looks to me like it would match IMG_3684(2).JPG(2).json

You're right.  It should be "%%d%%-.3f.%%e%%-3f.json"

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Buju

Oh, that did get me 1 step further!  Thanks, Phil (genuinely)!  That actually solved the original issue, it seems.  To be honest, I don't understand the format of the new -tagsfromfile argument though.  It's not a regular expression, but it seems to be able to "skip" characters or something.  Is there any documentation about it?

Now that that's solved though, I did another test and found another crazy situation where the corresponding json file is named abnormally.  Thanks, Google (sarcastically)!

For this file: C73731DA-567D-42D1-A1EB-55E118E05770_1_201_a.jpeg, the json file is named C73731DA-567D-42D1-A1EB-55E118E05770_1_201_a.j.json.  It's a real WTF moment, but it looks like there are many like this.  I'd guess it's about 2%-5% of the >60000 files.

My current command looks like this:

C:\tools\exiftool\exiftool -r -tagsfromfile "%%d%%F.json" "-gpsaltitude<geodataaltitude"="" "-gpslatitude<geodatalatitude"="" "-gpslatituderef<geodatalatitude"="" "-gpslongitude<geodatalongitude"="" "-gpslongituderef<geodatalongitude"="" "-gpsposition#<$geodatalatitude $geodatalongitude $geodataaltitude"="" "-description" "-title" "-keywords<tags"="" "-subject<tags"="" "-caption-abstract<description"="" "-imagedescription<description"="" -d "%%s" "-datetimeoriginal<phototakentimetimestamp" "-alldates<phototakentimetimestamp" "-filemodifydate<phototakentimetimestamp" "-filecreatedate<creationtimetimestamp" "-filename<${phototakentimetimestamp;s/(\d{3})\*//;$_=$self->InverseDateTime($_);DateFmt(qq(%%Y-%%m-%%d %%H-%%M-%%S))}%%-c - %%f.%%le" -tagsfromfile "%%d%%-.3f.%%e%%-3f.json" "-gpsaltitude<geodataaltitude"="" "-gpslatitude<geodatalatitude"="" "-gpslatituderef<geodatalatitude"="" "-gpslongitude<geodatalongitude"="" "-gpslongituderef<geodatalongitude"="" "-gpsposition#<$geodatalatitude $geodatalongitude $geodataaltitude"="" "-description" "-title" "-keywords<tags"="" "-subject<tags"="" "-caption-abstract<description"="" "-imagedescription<description"="" -d "%%s" "-datetimeoriginal<phototakentimetimestamp" "-alldates<phototakentimetimestamp" "-filemodifydate<phototakentimetimestamp" "-filecreatedate<creationtimetimestamp" "-filename<${phototakentimetimestamp;s/(\d{3})\*//;$_=$self->InverseDateTime($_);DateFmt(qq(%%Y-%%m-%%d %%H-%%M-%%S))}%%-c - %%f.%%le" -o "C:/media/image/mami/googlephotos/" "C:/media/image/mami/Google Photos/"

It's so long because I had to duplicate all the tag interpretation because of what Phil said.  Should I just do another -tagsfromfile argument, and if so, what should it look like to match the same thing but with just the first character of the file extension?

StarGeek

Quote from: Buju on August 24, 2020, 02:13:24 PM
It's not a regular expression, but it seems to be able to "skip" characters or something.  Is there any documentation about it?

See the -w (textout) option, especially the Advanced features section.
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).

Phil Harvey

Quote from: Buju on August 24, 2020, 02:13:24 PM
Should I just do another -tagsfromfile argument, and if so, what should it look like to match the same thing but with just the first character of the file extension?

Yes.  "%%d%%f.%%1e.json"

(you should understand this if you read the link StarGeek gave)

At this point your command is so long that you should be using the -@ option (read the docs for details).

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).