Issue when truncating UTF8-string - malformed last symbol

Started by VlCTOR, October 12, 2017, 05:13:21 PM

Previous topic - Next topic

VlCTOR

Hello, Phill.
Many thanks for your job.
I am using Windows 7 x64 with Exiftool 10.63

I'd like to point out the issue when truncating UTF-8 encoding string for some tags. It happens when length of the string (in bytes value) exceeds length of the corresponding metadata tag. Since UTF-8 provides up to six bytes per symbol it often happens that the string truncated incorrectly (the last symbol can't fit entirely). As a result the tag stores the string with malformed last symbol. See attached collage. This has improper looks and leads to the failures in the automatic photo cataloging by metadata.
I'm aware of the possibility of avoiding problems in part by using the -m option, but this is not a good solution because it's not following the IPTC standard.

It would be great if Exiftool could replace this last symbol with a dot which is a common symbol of shortening.

I have a lot of photos processed by the GeoTag application (http://geotag.sourceforge.net/). GeoTag uses your program to write metadata. Most of the photos contain damaged IPTC:Province-State and some other tags. Please advise how this problem can be solved at this stage.

Phil Harvey

This is a good point, but the situation is more complicated that one might think at first glance.  I will look into fixing this for the next release, but I will likely be changing malformed UTF-8 bytes to "?" and not ".".  Once this is done, you should be able to fix existing files by rewriting the IPTC with ExifTool.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

VlCTOR

Thank you for your attention to the problem and an immediate answer, Phil.

I understand that you need to apply the bit analysis to the last few bytes of the truncated string to find the boundary of the characters. Either start from the begin string and calculate string length symbol-by-symbol sum of variable symbol length until limit take.

Many people whose mother languages do not have the order of words imposed by grammar (e.g. Russian, Finnish, Ukrainian, Hungarian and more other languages) will be misled by the presence of a question mark at the end of the truncated string. They will propagate the effect of the question mark to the entire string instead of the last character. It is better to use the dot character as a generally accepted abbreviation. Or better add nothing to all.

What can I do now with the string those are already contained malformed UTF-8 bytes?
Can I use in config file function like this:
$mystring = iconv('utf-8', 'utf-8//IGNORE', $mystring);

Phil Harvey

The problem is not finding the boundary characters.  The problem is due to the fact that ExifTool doesn't know the final encoding until it processes each file, but currently ExifTool validates the value and truncates if necessary before the first file is processed -- I'll have to change this to do it as each file is processed because the same string may be encoded differently in different files depending on the value of CodedCharacterSet in each file.

ExifTool already has a "FixUTF8" function which replaces bad bytes with question marks, so it was most convenient for me to use this, but I understand your objection and will think about this.

I don't know what iconv does, so I can't help there.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

VlCTOR

I understand you, Phil. I hope you have the time and strength to realize it.
Once again, I want express my gratitude to you for doing a great job.

ydobemos

Quote from: Phil Harvey on October 13, 2017, 07:37:33 PM
ExifTool already has a "FixUTF8" function which replaces bad bytes with question marks, so it was most convenient for me to use this, but I understand your objection and will think about this.

Hi there. First of all, thank you for creating such a popular and useful tool. Second, Happy New Year!

Sorry for bumping this topic, but my question is directly related to this sentence. Is there a way to disable this FixUTF8 functionality and have ExifTool just directly copy whatever data is there, bit by bit? I am using a 360 camera (Mi Sphere) and it writes a rotation matrix from its gyroscope data to the UserComment field, which is basically a bunch of pure 32 bit floats. And no matter what I try ExifTool wrecks that data by converting what it thinks are malformed characters to question marks. Which makes the applications that stitch the images from the camera not work, so I cannot really edit the images successfully.

So basically if there was a way to just directly copy that data from an original file to an edited one (I use Affinity Photo, which also uses ExifTool, so also wrecks the data) that would be great.

Not sure if it will work here, but here's what that data looks like before: "M�.�}�.:��z<...���.�~.7:��z<��4:L�.?"  and after: "Mø.¿}è.:ìöz<...ºùÿ.¿~.7:Œøz<ìÉ4:Lø.?"

Phil Harvey

You can override the definition of EXIF:UserComment to defeat the character decoding.  A config file like this should allow you to read/write it as binary data:

%Image::ExifTool::UserDefined = (
    'Image::ExifTool::Exif::Main' => {
        0x9286 => {
            Name => 'UserComment',
            Format => 'undef',
            Writable => 'undef',
            Binary => 1,
        },
    },
);
1; #end


- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

ydobemos

Cool, thanks for the super-fast reply. Will give it a try asap... ...figured out the config...

And there may be a problem. For some reason it inflates the data to twice the size:

D:\MI SPHERE\20171230>exiftool -UserComment IMG_20171230_130628.DNG RAW_20171230_130628.jpg
======== IMG_20171230_130628.DNG
User Comment                    : (Binary data 36 bytes, use -b option to extract)
======== RAW_20171230_130628.jpg
User Comment                    : (Binary data 80 bytes, use -b option to extract)
    2 image files read


Any ideas of what may be causing this?

Phil Harvey

Use the -v3 option to see what the data looks like.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

ydobemos

Thanks. Looks like that came up when copying from the DNG file, but the camera produces a pair of DNG and JPG, so when copying from a JPG the size values are the same.

But the editing tools still don't accept the file... so I started digging and reading and found out by comparing the output that the Exif Byte Order is different between the files. When comparing the main difference is that the original image uses Little-endian (Intel, II) and the edited one uses Big-endian (Motorola, MM).

I dug through the forum and tried this:
exiftool -tagsfromfile IMG_20180101_202950.JPG -all:all -unsafe -exifbyteorder=little-endian TOAST.jpg

But the result was the same. Also, as you can see in the attached image there's a decent amount of differences in the files. Is there some way to just directly transplant the Exif data from one file to another? No error checking, no byte conversions?

Phil Harvey

A direct transplant is done like this:

exiftool -tagsfromfile IMG_20180101_202950.JPG -exif TOAST.jpg

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

ydobemos

Fantastic, that works! Thanks again.

However, it works for "clean" files, that do not have any additional metadata added. If I have edited the files and the editing program has added anything extra the "native" 360 camera app does not recognize them. But if I completely remove all metadata using ExifTool and then use the command above everything works out.

I tried adding the command to remove all tags to the destination file, like this:
exiftool -tagsfromfile IMG_20180101_202950.JPG -exif TOAST.jpg -all=

...but that seems to remove all the data AFTER copying the tags from the donor file. Is there a way to clear the receiving file's metadata and do -tagsfromfile in a single line? If not that's no biggie, I will look into automating this process anyway. But it's a big step forward already.

Phil Harvey

Quote from: ydobemos on January 02, 2018, 09:06:31 AM
exiftool -tagsfromfile IMG_20180101_202950.JPG -exif TOAST.jpg -all=

...but that seems to remove all the data AFTER copying the tags from the donor file. Is there a way to clear the receiving file's metadata and do -tagsfromfile in a single line?

Yes:

exiftool -all= -tagsfromfile IMG_20180101_202950.JPG -exif TOAST.jpg

See FAQ 22 for more information.

- Phil

...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).