News:

2023-03-15 Major improvements to the new Geolocation feature

Main Menu

Concatenate fields and substrings

Started by Zoot, June 11, 2015, 05:29:45 AM

Previous topic - Next topic

Zoot

Hi all,

First of all, sorry for my bad English as non-native language.
Thanks to Phil for his tool and forum contributions to scour and find some answers.


I am librarian and have a collection of newspapers scans (40.000) to get online, referenced and, alas on a budget for this sort of work, it all should be done in-house.
I'm more a GUI kind of person though knowing a nifty CLI phrase can spare thousands of mouse-clicks.

I managed, thanks to Exiftool documentation (RTFM sort of things), this forum and trial & error to get good results building args files to input in ExiftoolGUI and inject all required fields in one clic in most of the scans and with some custom files for the seldom exceptions bizarre things.


But now, I think I feel stuck, lacking of RegEx and coding skills to fill some fields hence posting here instead of GUI thread.

All the scans files are named like this:

B_libraryID_docID_yyyymmdd_00x.ext
B_<9chars>_<5/9ch>_yyyymmdd_<3ch>.ext


In fact, there is always the same number of chunks separated by underscores.
Only the docID can go up to 9 characters.
The last part is allways yyymmdd_<3ch file number>.


I managed with to get the docID into the TransmissionReference  or ${filename;$_=substr($_,-10,2)}/${filename;$_=substr($_,-12,2)}/${filename;$_=substr($_,-16,4)} to get the date read and reversed to display dd/mm/yyyy (wich I'm proud of, see my level :o) where I can shorten the phrase to deal with weekly or monthly news.


I'd like to get the ObjectName look like papers_title du dd mmmm yyyy, page x

- the title could be manually typed --filling a daily news for a bunch of years is not a big deal (1200 scan a year)

- for ease of use and it's better looking is there any way to get the mm translated in full text (mmmm, janvier, février etc. --yes, in French) ?

- the file numbering is the same as the number of pages with leading zeros, is it possible to automaticaly remove as necessary zeros to get page x?


If anyone could give me a push in the back? Thanks in advance.

--
Didier


Phil Harvey

Hi Didier,

You're doing well so far.

Quote from: Zoot on June 11, 2015, 05:29:45 AM
I'd like to get the ObjectName look like papers_title du dd mmmm yyyy, page x[/tt]

I don't know what you mean by "du".  Translating the months is easy (but results in a very long expression).  It goes something like this:

'-objectname<papers_title ${filename;$_ = /(.*?)_(.*?)_(.*?)_(\d{4})(\d{2})(\d{2})_0*(\d+)/ ? qq( du $6 ).{"01"=>"Janvier","02"=>"Fevrier","03"=>"Mars"}->{$5}.qq( $4, page $7):undef}'

Here I have only added translations for the first 3 months.  I'll leave it up to you to add the rest.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Zoot

Hi Phil,

Thanks for the fast feedback. By the way, I'm on a good ol' wheelbarrow on Windows XP.

I should have been more specific about the French parts of the field content, sorry :
The files are, choosing a specific example:
B_593506101_Jx362_19150115_001.jpg (all jpg for the moment, later for PDF and side files)
B_593506101_Jx362_19150115_002.jpg
B_593506101_Jx362_19150115_003.jpg
B_593506101_Jx362_19150115_004.jpg

the resulting display should be:

La Gazette des Ardennes du 15 janvier 1915 (meaning "La Gazette des Ardennes, [from] January 15th, 1915")


And finishing with completing the months names, and because of French typographic rules I added accents and removed the capitalization, could the accents I added lead to errors?

'-objectname<papers_title ${filename;$_ = /(.*?)_(.*?)_(.*?)_(\d{4})(\d{2})(\d{2})_0*(\d+)/ ? qq( du $6 ).{"01"=>"janvier","02"=>"février","03"=>"mars","04"=>"avril","05"=>"mai","06"=>"juin","07"=>"juillet","08"=>"août","09"=>"septembre","10"=>"octobre","11"=>"novembre","12"=>"décembre"}->{$5}.qq( $4, page $7):undef}'

I understand you use the blocks separated by undersocre instead of position of digits, right?
But I don't understand the latter part "undef".

Phil Harvey

The accents should be fine, but you should make sure you write the appropriate IPTC CodedCharacterSet.

The regular expression:  /(.*?)_(.*?)_(.*?)_(\d{4})(\d{2})(\d{2})_0*(\d+)/ returns true if it matches.  If it doesn't match, then I don't want to write ObjectName, so in this case I set $_ to undef.  (the "a ? x : y" operation returns "x" if "a" is true, otherwise returns "y")

If it does match, then $1, $2, etc are set from the strings captured by the brackets in the expression.  I use the "_" as the separator for the first few strings, allowing them to be variable length.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Zoot

Good evening (GMT+2 CEST here)

Thanks for your explanations, Phil.

Even if I think I followed the instruction to install Exiftool on Windows to the letter (and it worked until then), I got many errors message with your phrase or mine.

I tried to carefully shorten or simplify them to get where there was something wrong.

Tried to dial with the " or ', no dice, it was either the complete papers_name followed by original filename that got tagged or error message that there was no field to fill yada-yada (or so).

That was quite frustrating, actually, as I was sure at the end of workday that you posted without any comments about errors in my last magic phrase or, it happened sometimes, that you came back saying sorry, that you misspelled a comment or read an OP to fast.


Here at home, I have a Linux Mint installed on its own SSD (beside my mostly daily ride Win 7 Ultimate), I could install the last 9.97 Unix kit (I'm not that clumsy finally).
Having some sample of the files (or could mimic the names on any JPG) for another batch thing I work on at home, I opened the folder in a shell, typed the juju and voilà! That worked fine.


So tomorrow, back on my workhorse on Win XP, I'll get all the samples directories at the root of C:\, no space or silly characters (I use to avoid such things but)... And later find a way to check or clean up the path (or need to ask the city hall sysadmin to get direct access to it) to the main server where the real deal sit.

Problem is we are not allowed to boot anything else than the main HDD (BIOS blocked), I couldn't even boot from a live ISO sitting in my smartphone (DriveDroid for Android helps so much) or a stick (bummer!) to use a Unix/GNU thing.


I'll come back later with a link to the real thing online to share with whoever care of old newspapers. It's not only me you helped, it's the public I work for.

So many thanks (again), Phil, for this incredible toolbox you made and the help you provide to us all.

Phil Harvey

Sorry, I thought you had figured out the quoting (or at least got the hint from my signature).  On Windows, you need to use double quotes, not single quotes around arguments.
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Zoot

No problem, Phil, I had it from the start, somewhere in the manual. :)
I probably have neglected it somehow this afternoon or it was anything else.

Back on Win 7 at home and tested the "clean path way" (all at the root of a disk, samples contained in "test" folder) and with args files, spare/avoid this quote problem I call from the shell in the folder containing the files.

The only problem, now, is when needing to dive in folder containing folders etc. then the files: "-r" option should deal with it, as -r -@ args C:\\test\

Windows shell is not very userfriendly when you want to edit a little part right in the middle of a long phrase, as I remember a shell in Mac OS X a long, long time ago where you were able to double clic where you needed the insertion point/cursor to be.
Args files are handy to keep aside, getting duplicated then barely or vastly modified and saved as anything else.



The possibilities of the juju begin to be barely understandable.

I figured out how to remove the leading zero of the day number for a daily news: Un Quotidien du 5 novembre 1914, page 6

ObjectName<Un Quotidien${filename;$_ = /(.*?)_(.*?)_(.*?)_(\d{4})(\d{2})0*(\d+)_0*(\d+)/ ? qq( du $6 ).{"01"=>"janvier","02"=>"février","03"=>"mars","04"=>"avril","05"=>"mai","06"=>"juin","07"=>"juillet","08"=>"août","09"=>"septembre","10"=>"octobre","11"=>"novembre","12"=>"décembre"}->{$5}.qq( $4, page $7):undef}

Then how to deal with a monthly magazine: Un Mensuel de novembre 1914, page 6

ObjectName<Un Mensuel${filename;$_ = /(.*?)_(.*?)_(.*?)_(\d{4})(\d{2})0*(\d+)_0*(\d+)/ ? qq( de ).{"01"=>"janvier","02"=>"février","03"=>"mars","04"=>"avril","05"=>"mai","06"=>"juin","07"=>"juillet","08"=>"août","09"=>"septembre","10"=>"octobre","11"=>"novembre","12"=>"décembre"}->{$5}.qq( $4, page $7):undef}


Now, I'd better go sleeping. The rooster yells in 5 hrs.

Phil Harvey

On thing I noticed that should work but doesn't is the -userParam option when copying tags.  I will fix this in ExifTool 9.98 so that you won't have to change your argfile.  With this version, your argfile could be:

ObjectName<$myname ${filename;$_ = /(.*?)_(.*?)_(.*?)_(\d{4})(\d{2})0*(\d+)_0*(\d+)/ ? qq($6 ).{"01"=>"janvier","02"=>"février","03"=>"mars","04"=>"avril","05"=>"mai","06"=>"juin","07"=>"juillet","08"=>"août","09"=>"septembre","10"=>"octobre","11"=>"novembre","12"=>"décembre"}->{$5}.qq( $4, page $7):undef}

And the commands would then be:

exiftool -@ my.args -userparam myname="Un Quotidien du" DIR

exiftool -@ my.args -userparam myname="Un Mensuel de" DIR

although I notice you left the day of the month from your second command for some reason, and this wouldn't take care of that difference.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Zoot

Hey,

I'm glad to help you improve Exiftool a little.


In the "monthly" formula, the code could be stricter or lighter, didn't dare to touch it too much, leaving the entire RegEx control and only removed the day of month part ($6 chunk). It only took me a couple tries, like the first one to dig how to deal without the leading zero.

Like one understand a foreign language without being able to speak or even write it or listen to a new jazz arrangement (kudos to Ornette Coleman) following the tune the first time, I wouldn't be able to generate a complete phrase from scratch but I'm somehow pleased to understand a bit and not being a complete drag for you.

Zoot

Good afternoon,

Some news on the workplace WinXP rig where I've simplified the path of the directories & files to tag.

All is working, the daily/monthly toggle, semi-auto filling other fields.

I wanted this done before tomorrow, other plans for the coming two weeks, production launch in July.

Thank you, Phil.  8)

Zoot

Hi,

If I can sometime tweak a bit some things read to get what I need, like to get "1er" (meaning "premier", like english "1st", "2nd", "3rd"), I'm in pain understanding what I have to do by reading the many posts about searching & replacing a few characters in one or many tags.

Actually, after tagging all my files, I see that I have double spaces [  ] somewhere that I don't want, needing only one space char (stupid me, I know, leaving one space after a word and the other before a variable in a formula).

-XMP-dc:Title<${tr/  / /} or -XMP-dc:Title<${tr/  / /} or variants give unwanted results.

Naturally, messing with CLI, made some mystiping (my bad) leading to (censored) even jumping to ctrl+C to stop the mess.

Thanks for any tip.




Phil Harvey

Quote from: Zoot on July 09, 2015, 04:34:09 AM
If I can sometime tweak a bit some things read to get what I need, like to get "1er" (meaning "premier", like english "1st", "2nd", "3rd"), I'm in pain understanding what I have to do by reading the many posts about searching & replacing a few characters in one or many tags.

I expect you need different things after different numbers, like english "st", "nd", "rd", "th"...  This gets complicated.  If you really want to do this, I would suggest moving away from the advanced formatting, and creating a Composite user-defined tag to do this for you.   It would simplify the command line and make things easier to understand.

QuoteActually, after tagging all my files, I see that I have double spaces [  ] somewhere that I don't want, needing only one space char (stupid me, I know, leaving one space after a word and the other before a variable in a formula).

-XMP-dc:Title<${tr/  / /} or -XMP-dc:Title<${tr/  / /} or variants give unwanted results.

"tr" translates single characters.  Use "s" instead for string substitutions.  For "s", you must add a "g" afterwards if you want to replace more than one occurrence:  s/  / /g

Also, your syntax is incorrect.  You want to do this:  "-xmp-dc:title<${xmp-dc:title;s/  / /g}"

- phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Zoot

Thanks for your fast update, Phil.

The phrase begins to get complicated, indeed:
-XMP-dc:Description<édition du ${filename;$_ = /(.*?)_(.*?)_(.*?)_(\d{4})(\d{2})(\d{2})_0*(\d+)/ ? {"01"=>"1er","02"=>"2","03"=>"3","04"=>"4","05"=>"5", [snip!] "28"=>"28","29"=>"29","30"=>"30","31"=>"31"}->{$6}.qq( ) .{"01"=>"janvier","02"=>"février","03"=>"mars","04"=>"avril","05"=>"mai","06"=>"juin","07"=>"juillet","08"=>"août","09"=>"septembre","10"=>"octobre","11"=>"novembre","12"=>"décembre"}->{$5}.qq( $4, page $7):undef}

But in an .args file called by CLI, it is rather straightforward and fast.


And about search & replace, I tried "s" too.
That was the part between braces, I tried something like yours, here, but somehow missed something.  ::)


You saved my day, I have some papers from middle 19th (when power wanted to catch their authors, seize their tools) that changed names three times a month or so, it would have been a really PIA to crawl in it again.

Zoot

Hello Phil, hello all,


Tagging all the files was quite easy—and done by mid August—comparing to getting them online. The archive gallery tool we have is a bit sloppy and most of all, abandonned, so no real support.
Many crashes not only the backoffice but the public part went south so many times needing backup restauration. I really like to waste time with such things.

Since there is no real statistic tool to count the hits to the portal, we could measure how much people really watch this by their emails complaining about the tool being unavailable. Always look at the bright side of things, huh?


Browsing the pics is not very straightforward (except sometimes by modifying directly the addressbar). We should have a more modern replacement next year.
You can use the joystick (or compass rose as I say to old persons) to navigate through the levels.

You can have a glance at a sample of our precious, here.

The principal and oldest daily newspaper (since 1819) but only the WW1 days were scanned, l'Écho du Nord (the last / is important to see something) or the Liller Kriegszeitung, its replacement published during the occupation by and for German troops (printed with French types so no ß or umlauts); a real magazine aimed to entertain the soldiers. Great editorial performance and propaganda tool.
Many German great (or soon to be) writers and illustrators worked on this paper and its illustrated supplement. With ExifTool, I could tag the latter specifically so it would be searchable.

Some others, more peaceful, La Flandre illustrée, Le Nouvel almanach de poche and going back to the root, there's plans, maps, manuscripts and whatnot.

Sooooo many thanks (again) to you, Phil for this great tool and the contributors giving tips to do more with it.