CSV resource usage detecting common tags

Started by TropicBeau, September 10, 2013, 11:21:50 PM

Previous topic - Next topic

TropicBeau

This is my first post to this forum, so the first order of business is to say "Thank you, Phil" for this extraordinary piece of software. What a labor of love!

My quibble/question is the limitation of -csv commands to a relatively limited number of files, as noted in the documentation. It seems to me that the reason for this is that all the files are initially scanned to determine what tags they have, from which a common tag set is developed and used to write the header and build a template for each row. Fair enough in the absence of any other information. But if the user has specified a set of tags that are of interest, I don't see why this is necessary - just use that set and go.

That was, in fact, the case in an earlier post to the Forum entitled "Using to many resources" where Emilio complained that the command
   'exiftool.exe -r -FacesDetected -AFAreaMode "&DIR"  > resultado.csv'   
ran out of memory processing a large image set, even though there were at most two tags of interest.

Not only is this prescan unnecessary if the user has specified the tags he's interested in, it's actually a problem. If the user has specified, say, 6 tags of interest but the files he scans only happen to contain 4 of them, the prescan drops the two tags that don't occur. This means that the csv file is of unpredictable column size and content. If one is importing that into some other software, one sometimes has to massage it back to the table you were expecting.

So, might I respectfully suggest that, if the user has specified a tag set of interest, there be some way to say "skip the prescan and just give me a table with these tags, present or not". That would make the -csv[\tt] command very useful for collecting just a few tags (in a tabular format readable by other programs) from a very large number of files.

- Beau


Phil Harvey

Hi Beau,

Quote from: TropicBeau on September 10, 2013, 11:21:50 PM
But if the user has specified a set of tags that are of interest, I don't see why this is necessary - just use that set and go.

Not so.  If you specify -a or -g, or don't specify -f, then exiftool still needs to read all of the files.  There could be other options that would require this as well.

You could use the -T option, or even the -p option if you have a set of fixed tags.  The only difference would be that special characters (comma and quote for example) could be a problem.

QuoteIf the user has specified, say, 6 tags of interest but the files he scans only happen to contain 4 of them, the prescan drops the two tags that don't occur.

Add the -f option if you don't like this behaviour.

QuoteSo, might I respectfully suggest that, if the user has specified a tag set of interest, there be some way to say "skip the prescan and just give me a table with these tags, present or not". That would make the -csv[\tt] command very useful for collecting just a few tags (in a tabular format readable by other programs) from a very large number of files.

I'll have to think about this more.  Perhaps I could do this with a very specific set of options (-f, but not -a or -g, etc...), but I'm sure that the documentation for this would be a bit confusing.

- Phil

Edit:  Hmmm.  Then someone would complain: "why can't I use -g in this mode if I specify all of the group names for the tags".  This would be a problem because ExifTool doesn't know what family the groups are in, so the extracted group names could be different.
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

Ah.  I just found a deal breaker, trivial though it is:

ExifTool has no way to tell the case for the tag-name headings until it has scanned the files to extract the tag names.  Currently, ExifTool uses the proper case for tag names that exist in the file, regardless of the case specified on the command line.  This behaviour would need to be changed if your suggestion was implemented.

I agree it is a small problem, but it would make the interface less consistent, which I don't like.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

TropicBeau

Thanks for the detailed replies!

I suspected that the problem was interaction with other possible command line options (and case insensitive tags) - there are way more options than I've explored, for sure. But, I don't think that running out of memory on large file sets is a reasonable way of dealing with these. If you really have to compute the set of common tags by looking in each file before starting writing the CSV table, please do it in two passes - compute the common tag set and then write the table in a second pass. (And then expose that 2nd pass interface to me!  :-) )

BTW, there is reference to a -common command line option in the examples, but I couldn't find it specified in my docs file (for version 9.32). It does seem to do what the csv tag filter does, so the code must be there.

As for the -T, -p and -f options, these all generate output that would have to be massaged into a format for import into a database, whereas the -csv option generates something which is already in a very convenient format for that. It's so convenient that I think it's worth making it work, even for large file sets.

That said, nothing urgent (at least for me) - I think I'm a few thousand files away from hitting that limit. But I'm adding files all the time - hopefully it will all work for me by the time I get there. :-)

Phil Harvey

Quote from: TropicBeau on September 11, 2013, 11:36:54 PM
If you really have to compute the set of common tags by looking in each file before starting writing the CSV table, please do it in two passes - compute the common tag set and then write the table in a second pass. (And then expose that 2nd pass interface to me!  :-) )

Wow.  That would take a long time.

What platform are you using?  I would think that the memory problem would only affect the Windows exe version since it is the only version with a memory restriction.  (Unless you are talking about tens or hundreds of millions of files.)

QuoteBTW, there is reference to a -common command line option in the examples, but I couldn't find it specified in my docs file (for version 9.32).

Look in the Shortcuts Tags documentation.
- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

TropicBeau

Quote
Quote
[If you really have to (look) in each file before starting writing the CSV table, do it in two passes - compute the tag set (of interest) and then write the table in a second pass.
Wow.  That would take a long time.

Well, it would double the amount of time spent reading the files. Is that all that bad? Maybe there could be a -csv2 (two pass csv) for folks with large file sets who are willing to pay that price.

Quote
What platform are you using?  I would think that the memory problem would only affect the Windows exe version since it is the only version with a memory restriction.

Yes, it's Windows. As I mentioned, it's not biting me yet, but I don't want to build my work flow one way and then have to change it as my collection grows.

On another front, I completely misunderstood the -common tag. I thought it was a command line option for which I couldn't find documentation. A later example explains this for the -AllDates shortcut. Maybe that explanation might appear in other places where shortcuts are used, to head off newbie confusion.

Thanks again.

- Beau

Phil Harvey

Hi Beau,

I'll think about this problem.

Quote from: TropicBeau on September 12, 2013, 04:24:20 PM
On another front, I completely misunderstood the -common tag. I thought it was a command line option for which I couldn't find documentation. A later example explains this for the -AllDates shortcut. Maybe that explanation might appear in other places where shortcuts are used, to head off newbie confusion.

I use -common three times in the application documentation.  I'll reference the shortcuts tags from one of them.  Thanks for the suggestion.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).