Removing BOM from tags

Started by mazeckenrode, September 11, 2020, 01:11:56 PM

Previous topic - Next topic

mazeckenrode

Now revisiting the problem of dealing with BOMs (UTF byte order marks) at the beginning of various EXIF tags written by Directory Opus (previously discussed here), which I often wish to modify or otherwise  use in additional metadata operations involving ExifTool but find the BOM interfering. I was hoping to incorporate some regex replacement code that would get rid of BOM into my command lines, but having trouble devising something that works. If I load an ExifTool-exported JSON file that contains BOMs into Notepad++, I can successfully search for \x{FEFF} and replace it with nothing. Based on that, and on other examples of regex usage here in the forums, I came up with the following command line test to hopefully eliminate the BOM in one tag from the source PNG that was used to generate the JSON:


ExifTool "-EXIF:ImageDescription<${EXIF:ImageDescription;m/\x{FEFF}?(.*)/; $_=$1}" "Meta_BOM.png"


When I execute that, ExifTool reports no warnings or errors, and creates an updated version of the PNG with a handful of additional EXIF tags, but in a new JSON exported from the updated image, the BOM is still there. Do I need to modify my code, or is there possibly another problem somewhere?

Also, for future reference, what exactly determines whether a request for help should be posted here in Newbies, or in The "exiftool" Application? Apart from the implication that Newbies is intended for users with little no ExifTool experience, both forums appear to be largely populated with requests for help. Or does it even matter to you guys? Just want to make sure I'm posting to the right forum, if there is one.

StarGeek

Maybe try it as separate bytes instead of single wide byte?
m/\xFE\xFF?(.*)/; $_=$1

But then, I have no clue.  As I mentioned I've occasionally come across BOMs midstring while scraping data.  In my notes, I have this regex to remove them
s/\xEF\xBB\xBF//g
So 3 bytes instead of 2.  No idea as to the difference.

Edit: As to the sub-forums, I rarely notice the location.  I always read through the Show unread posts since last visit near the top.  Most often I'll only worry when it's in Metadata and is more relevant to Newbies/Exiftool Application or when it's someplace else and should be in Developer or Bugs/Features.

Edit 2: Found the source where for when I was dealing with the midstring BOMs.  This StackOverflow answer may have some other options to try.
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).

Phil Harvey

Yes.  Newbies and the ExifTool application boards are for the same thing.  I added the Newbies forum only because some new users were intimidated to post on the application forum.

Regarding the regex, all values are a stream of bytes (not wide characters) in ExifTool.  So I would do this to remove a leading BOM:  ${EXIF:ImageDescription;s/^\xef\xbb\xbf// or $_=undef}

EF BB BF is a UTF-8 BOM (thanks StarGeek), which is what you should get unless you change the ExifTool charset.

Setting $_ to undef if nothing was changed prevents ImageDescription from being copied, so only the files with a BOM will get updated.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

mazeckenrode

Quote from: StarGeek on September 11, 2020, 02:01:01 PM
I have this regex to remove them
s/\xEF\xBB\xBF//g
...
This StackOverflow answer may have some other options to try.

So basically, I've now tried it all these ways, which were suggested in the Stack Overflow answer:

^\x{FEFF}?(.*)
^\N{U+FEFF}?(.*)"
^\N{ZERO WIDTH NO-BREAK SPACE}?(.*)
^\N{BOM}?(.*)


None removed the BOM, and the latter two caused Warning: Can't locate _charnames.pm in @INC (you may need to install the _charnames module), which you guys might have already guessed.

Next I tried Star Geek's way, ^\xEF\xBB\xBF(.*), which worked for me. Interestingly, this is also the way that Signal15 (initiator of the Stack Overflow topic) had tried, though it had reportedly failed for him. He was working with Perl, and ExifTool is Perl, so I'm confused as to why it works with one but not the other.

I then tried putting my BOM-matching code inside a non-capture group, so I can get other operations to work on the same tag(s) in the same command line, whether a BOM is there or not:

ExifTool "-EXIF:ImageDescription<${EXIF:ImageDescription;m/^(?:\xEF\xBB\xBF)?(.*)/; $_=$1}" "Meta_BOM.png"

This seems to have done what I wanted.

Quote from: Phil Harvey on September 11, 2020, 03:44:22 PM
Newbies and the ExifTool application boards are for the same thing.
...
Setting $_ to undef if nothing was changed prevents ImageDescription from being copied, so only the files with a BOM will get updated.

Ok, I'll keep that mind.

Thanks again to both of you for the help.

mazeckenrode

After removing a few BOMs (in the course of other metadata manipulations) and seeing that their removal didn't prevent Directory Opus from being able to display or edit the affected metadata, I started a help topic in their forum, asking if the BOMs are actually necessary, or could perhaps phased out in a future update. One of their admins replied:

"BOMs remove ambiguity regarding which 8-bit codepage is in use, and I think are written by other tools as well. (AFAIK, we did not invent doing that for this type of metadata.)"

He went to suggest that I instead ask if ExifTool could be made to handle BOMs differently, so I pose the question here: Is there any merit or interest, or lack thereof, in changing ExifTool's default handling of BOMs, and/or implementing an option to leave them in place as first characters during write operations when they already exist there, and leave them out of any string-value manipulations otherwise? (Or is there already a way to do this, that I've missed?) I'm sure I can modify my prepend command line to take care of preserving BOM in that location, but just wondering.

StarGeek

Quote from: mazeckenrode on September 12, 2020, 07:38:53 PM
"BOMs remove ambiguity regarding which 8-bit codepage is in use, and I think are written by other tools as well. (AFAIK, we did not invent doing that for this type of metadata.)"

I'd have to disagree with them about other tools doing so.  I've never seen a BOM in any image I've collected from the web, nor have any of the tools I've tested ever used a BOM.
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).

Phil Harvey

Quote from: StarGeek on September 12, 2020, 09:05:45 PM
I've never seen a BOM in any image I've collected from the web, nor have any of the tools I've tested ever used a BOM.

I'd have to agree.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

mazeckenrode


mazeckenrode

More challenges with this...

Previously, I successfully used this in a Directory Opus custom command to present me with a string dialog ({dlgstring...), set a DOpus variable (@set descprfx=...) to that string, then use ExifTool to prepend the string to an existing EXIF:ImageDescription tag that was originally written by DOpus, eliminating the first-character BOM in the process:

@set descprfx={dlgstring|Enter string to prepend to Description/Subject/Title/Comment}$
ExifTool "-EXIF:ImageDescription<{$descprfx}{EXIF:ImageDescription;m/^(?:\xEF\xBB\xBF)?(.*)/; $_=$1}" .


Some explanation, in case needed: While the string returned by {dlgstring... can include a trailing space, @set descprfx=... strips any leading and trailing spaces from it. When I prepend to an existing tag value, my string to be prepended nearly always has a trailing space, for example "fiddlesticks; ". For that reason, I moved the $ from just before {EXIF:ImageDescription... to the end of my @set descprfx=... operation. DOpus then passes the string to ExifTool.

At this point, I've elected to attempt preserving the BOM as the first character in EXIF:ImageDescription, so I'm revising my command, but attempts so far are giving me errors, so I must be missing something. Headway, or lack thereof:

@set descprfx={dlgstring|Enter string to prepend to Description/Subject/Title/Comment}.$
ExifTool "-EXIF:ImageDescription<${EXIF:ImageDescription;m/^(\xEF\xBB\xBF)?(.*)/; $_=$1.{$descprfx}2}" .


Results in:


Warning: syntax error at (eval 89) line 1, near "; ."
Bareword "fiddlesticks" not allowed while "strict subs" in use for 'EXIF:ImageDescription' - ./DOpus_Meta_BOM.png


@set descprfx={dlgstring|Enter string to prepend to Description/Subject/Title/Comment}$
ExifTool "-EXIF:ImageDescription<${EXIF:ImageDescription;m/^(\xEF\xBB\xBF)?(.*)/; $_=$1{$descprfx}2}" .


Results in:


Warning: syntax error for 'EXIF:ImageDescription' - ./DOpus_Meta_BOM.png
1 directories scanned
1 image files updated


(But value of EXIF:ImageDescription did not change.)

How do I incorporate the string from DOpus between $1 and $2?

StarGeek

The first error tells you that you are passing a Bare Word, not a character string.  You need to make sure it's treated as a string, which means it needs to be quoted. So you're currently passing something like
$_=$1.fiddlesticks
when you need to be passing
$_=$1.'fiddlesticks'
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).

mazeckenrode

Quote from: StarGeek on September 16, 2020, 01:02:24 PM
You need to make sure it's treated as a string, which means it needs to be quoted.

Bingo. Working revised command:

@set descprfx='{dlgstring|Enter string to prepend to Description/Subject/Title/Comment}'
ExifTool "-EXIF:ImageDescription<${EXIF:ImageDescription;m/^(\xEF\xBB\xBF)?(.*)/; $_=$1.{$descprfx}.$2}" .


Thanks yet again!