json in and json out for working with databases

Started by Skippy, September 17, 2015, 06:51:47 PM

Previous topic - Next topic

Skippy

I am looking for a fast performing way of updating a database with exif information.  By now I have tried a lot of ideas and know exiftool reasonably well and can't easily solve the issues I have with the current version of exiftool.  Here are the scenarios:

  • Database application scans storage media and makes a list of all photos that it has not seen before; or
  • Database application scans storage media and makes a list of all photos that have been modified since they were last checked.
The two scenarios are different as new photos will tend to be in the same folder and have the same creation date, whereas modified photos could be scattered throughout the whole folder structure and are likely to be surrounded by photos that have not been modified.

Finding photos which need their exif data updated is not hard using Windows filing system functions and making a json file that includes all the file names is not difficult either.  The list of tags read by exiftool could be specified by command line arguments and does not need to be in the input json file.  However exiftool does not currently consume a json file that identifies photos that need to be read and then output another json file with the results of reading the photos.  Unfortunately, the alternatives are not really pretty.

One alternative is to recursively scan the folder structure starting with the root folder of the search process.  Exiftool supports expressions that can be used to filter out results and I used the expression for finding photos taken on or after a particular creation date to find new photos.  There are a few issues with this process.  Even with the -fast2 switch set, the process is slow as exiftool would have to read and process some of the exif data to decide whether or not to include the file in the results.  Exiftool also has to read all the photos in every folder in the folder tree, even folders that contain no new or modified photos.  There is an ignore [folder] tag, however I have not  used this so far.  One reason is that if I am scanning an SD card for new photos, the search root folder DCIM\ can contain hundreds of folders.  Some cameras make a new folder for each day (I like this), so you have as many folders as days that you have used the camera.  My SD cards are only used once as they become my archive of last resort once they are full.  Consequently, a near full SD card commonly has over 8000 photos on it, all in subdirectories of the DCIM folder.  Constructing an ignore [folder] list would create a very long command line.  It also involves a lot more code development on the database side.  The exiftool documentation is also not very clear on whether the whole path for the ignore folder is required or just the relative path starting with the root folder of the recursive search. For the record exiftool can scan a 32 GB SD card with about 8000 photos in about 3.5 minutes but during this time, I can't provide any progress reports to the user.

The other alternative is to invoke exiftool separately for each folder, which incurs the start-up overhead.  On my test SD card, there were 1500 new photos scattered over more than 100 folders.  I would also have to write a process for reading the results from all the json files produced.  This is probably the way I have to go at the moment.     

Because I am using ms-access as an application development platform, the -stayopen approach and piping does not seem to be open to me.  To be honest, I still do not fully understand this approach. 

I did try making a json input file but later realised that a json input file is associated with writing tags to photos and cannot be used to specify a file list for reading photo tags that are output in the form of another json file.  For me, json in and json out would be the idea solution.  Currently there is a single -json switch and this might have to be changed to something like injson, outjson.  Replacing the > output redirection would also work for me as the shelling process does not recognise the output redirection symbol. 

The final alternative would be to have a dll which is acceptable to ms-access which calls exiftool via piping.  This might work but error handling starts to become difficult, hence my interest in the potential of a json in, json out workflow.  Can I put that at the top of my wishlist for exiftool.

Cheers,

Skippy


Skippy

In the post above, I have covered using json option but would also be happy with json in/XML out or any other permutation of json and XML.  I want to avoid using txt or csv for export as there could be commas in the data. 

Phil Harvey

Hi Skippy,

Quote from: Skippy on September 17, 2015, 06:51:47 PM
However exiftool does not currently consume a json file that identifies photos that need to be read and then output another json file with the results of reading the photos.

No, but it does consume a .txt file of file names that identify the files to be read.

QuoteEven with the -fast2 switch set, the process is slow

Yes.  If you need to parse the file it will be.  But if you are looking only at the system date/time tags, then -fast3 will definitely help.

QuoteThe exiftool documentation is also not very clear on whether the whole path for the ignore folder is required or just the relative path starting with the root folder of the recursive search.

The argument of -ignore is a single folder name.  Like "Pictures" for example.  NOT "C:\Pictures" or "Pictures/Friends" or any compound names like this.  Also, it is case sensitive.

QuoteThe final alternative would be to have a dll which is acceptable to ms-access which calls exiftool via piping.  This might work but error handling starts to become difficult, hence my interest in the potential of a json in, json out workflow.  Can I put that at the top of my wishlist for exiftool.

This isn't something that I would do, but other ExifTool developers may already have done this.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Skippy

#3
The final solution to the issue of how to get exiftool to read tags from a subset of files from a directory or set of directories then saving the results to a json file is likely to be Phil's suggestion of using a txt file to list the photos to be scanned. 

An example command line is:  exiftool -@ filelist.txt -filename -createdate -json > C:\Temp\exif\out.json.  In the documentation the pattern is  exiftool -@ ARGFILE -filename -createdate -json > C:\Temp\exif\out.json.  The -filename is needed to put the file name in the output file.

The contents of the ARGFILE look like this:
C:\Temp\exif\141_0604\DSCN6400.JPG
C:\Temp\exif\141_0604\DSCN6401.JPG
C:\Temp\exif\141_0604\DSCN6402.JPG
C:\Temp\exif\141_0604\DSCN6404.JPG


There are no quotes around the file names.

In the ms-access environment the full command would be shell("cmd /c exiftool -@ filelist.txt -filename -createdate -json > C:\Temp\exif\out.json").   If you call exif tool directly i.e. shell(exiftool args > out.json), the output will not get redirected to the json file.  If you want to read the tags for a list of photos, then process the json file as soon as it is available, then use the shell wait function mentioned in another one of my posts.  That is all the necessary information brought together in one place. 

Using the above process on Win 8.1/core i7 laptop and a photo collection on an external USB drive, I can get a json file for 4000 photos in about 70 seconds, which is approximately 50 photos per second.  I am happy with that. 

TSM

Investigate the -stay_open flag, that way you can execute once and just pass each file and it will return the metadata then move to the next one. Problem with working on a large list I would have thought is that you will only get the results when its all finished.
I wrote a PHP library that handles.

Skippy

If I am writing to thousands of files, I can start exiftool as a non-synchronous process that opens in a DOS window and that people can cancel by pressing Ctrl-C.  I make sure the progress option is used so people get some visual feedback on how things are going.  For reading, I can read in batches of a thousand and just start up a new instance of exiftool each time (usually I need to use shellwait to make reading from exiftool a synchronous process).  The 0.25 second overhead is not noticeable when more than about 50 files are processed at once.  Anyway those are my current solutions.  But you are certainly right about the issue of feedback to users if a long process has been started.  Usually I can put up a dialog telling them how long things are likely to take before starting the process and I will be able to calculate that based on the number of images to be processed.