IPTC data alt-lang question

Started by ScannerBoy, October 10, 2024, 11:52:31 AM

Previous topic - Next topic

ScannerBoy

Some time ago, I came across the attached jpg file.
From what I can tell, it only has IPTC data and from what I can see from the Exiftool output - run under Mint 21.3, i.e. without code page issues - much of the data text would seem to be in Spanish, but it does not display as expected.
For instance:
Headline                        : Fórum Mitos & Fatos â€" Jovem Pan Discute: A São Paulo do Futuro


My hope is that this is am instance where the alt-Lang text is poorly/wrongly encoded or recorded, but, for my own education, I would like to get some more expert opinions.


StarGeek

This usually happens when a (usually Windows) program that doesn't properly deal with characters beyond simple ASCII tries to write UTF-8 characters to the IPTC data.

See Mojibake, Other Western European languages.

"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

ScannerBoy

That is an interesting reference. But ....
 when I returned to one of my earlier efforts to help my understanding and reading image metadata, I came across some interesting notes I had left for myself. FWIW, at the time I was mainly using the Exiv2 libraries. As well, the old utility, using the Exiv2 libraries rendered the text as expected with a conversion from UTF-8.
At this time I am not at all sure where these comments came from, but the seem to hold true just the same:

Photoshop takes liberties with the interpretation of ASCII and ISO 646. Through version 7, it uses local OS single-byte (8-bit) encoding where TIFF/Exif and IPTC specify 7-bit ASCII or ISO 646. There is no indication in the file of what this encoding is. Photoshop 9 continues to use local OS single-byte (8-bit) encoding for IPTC, but uses UTF-8 in TIFF/Exif for tags of type ASCII.
Photoshop also ignores the Exif admonition to use the UserComment tag instead of ImageDescription when the value contains non-ASCII characters. Photoshop always writes ImageDescription, never UserComment
But note: I have found one file with only IPTC data, obviously using Spanish UTF-8 characters.
The even more interesting thing is that Exiv2 returns the Character Style for the chunk of IPTC data as the first item in its IPTC metadata.

Checking the hex data for the aaa.jpg file, the tag 005a - Iptc.Envelope.CharacterSet is indeed the last item in the APP13 Photoshop data segment, but when I retrieve the IPTC data for that file, it is the first entry returned. Even though I am not aware of any recommendations about this tag or the sequence of tags within the data chunk, it seems convenient to have it returned at the start, which allows using the information for the rest of the batch.
While I would readily agree that this may well be an 'old' file, it would still be convenient if one can handle the data in a way that make life easier for the users.

StarGeek

I really don't have anything I can help you with here, as I never could make sense of character coding issues. I always just either fixed the problem data manually or replaced it.

This file does have an additional warning. If you run
exiftool -g1 -a -s -warning -validate file.jpg
you'll see that one of the warnings is
IPTC doesn't conform to spec: Records out of sequence

It should also be noted that your note on ImageDescription is outdated as of EXIF 3.0. But that revision is only a little over a year old.
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

FrankB

I do agree with Stargeek that these file are best fixed, to conform to standards. But....

You can get results using:

chcp 65001
-charset latin

C:\Foto\ScannerBoy>chcp 65001
Active code page: 65001

C:\Foto\ScannerBoy>exiftool -charset latin -iptc:all aaa.jpg
Object Name                     : FORUM MITOS E FATOS-JOVEM PAN DISCUTE A SÃO PAULO DO FUTURO
Keywords                        : brasil
Special Instructions            : Fórum Mitos & Fatos – Jovem Pan Discute: A São Paulo do Futuro
By-line                         : Marcello Fim/Ofotográfico/Agência O Globo
By-line Title                   : Fotógrafo
City                            : São Paulo
Sub-location                    : Hotel Tivoli Mofarrej
Province-State                  : São Paulo
Country-Primary Location Code   :
Country-Primary Location Name   :
Original Transmission Reference :
Headline                        : Fórum Mitos & Fatos – Jovem Pan Discute: A São Paulo do Futuro
Credit                          : Marcello Fim
Source                          :
Copyright Notice                : Marcello Fim/Ofotográfico/Agência O Globo
Caption-Abstract                : São Paulo, SP, 11.03.2019: Fórum Mitos & Fatos – Jovem Pan Discute: A São Paulo do Futuro. Henrique Meirelles, Secretário da Fazenda do Estado de São Paulo participa do Fórum Mitos & Fatos - Jovem Pan Discute: "A São Paulo do Futuro", no Hotel Tivoli Mofarrej na zona central da capital paulista, nesta segunda-feira (11). O evento reuniu representantes do governo estadual, especialistas e empresários que se destacam em São Paulo para discutir o maior estado do país e os caminhos para a manutenção de seu crescimento. (Foto: Marcello Fim/Ofotográfico/Agência O Globo) Política
Writer-Editor                   : 30916
Coded Character Set             : UTF8

C:\Foto\ScannerBoy>

Using ExifToolGui it works by specifying this option via  'Options/Custom options".
(Other files may not display correctly)
etg.jpg

<off_topic>
I dont believe it is Spanish, but Portuguese (Brasil)
</off_topic>

StarGeek

Quote from: FrankB on October 10, 2024, 06:38:19 PMI do agree with Stargeek that these file are best fixed, to conform to standards. But....

You can get results using:

chcp 65001
-charset latin

Ah, that did it.

I was trying to use
-charset IPTC=Latin
but that didn't work.

I did find a website that could properly reverse it, but that wouldn't have been directly useful without some code.
"It didn't work" isn't helpful. What was the exact command used and the output.
Read FAQ #3 and use that cmd
Please use the Code button for exiftool output

Please include your OS/Exiftool version/filetype

ScannerBoy

Thank you all for the prompt and helpful comments and information.

Portuguese it  is; all I was sure of that it was not English ;-)

Part of my old utility does run ET to validate the file's content. Here is what it reports:

--- Output From ""C:\exiftool\exiftool.exe" -validate -warning -a "D:\TestImages\IPTC\aaa.jpg""
Validate                        : 3 Warnings (all minor)
Warning                        : [minor] IPTC By-line too long (43 bytes; should be 32 max)
Warning                        : [minor] IPTC Country-PrimaryLocationCode too short (0 bytes; should be 3)
Warning                        : [minor] IPTC doesn't conform to spec: Records out of sequence
--- End of output ---

So yes, the output has problems; it is apparently out of sequence, though the error message does not identify which records are out of sequence - FWIW, I have not been able to confirm that the Charset spec MUST be first - though that would be expected and make sense.

From experience and reading many a comment on metadata and utilities to read and write them, the biggest take away I have is: don't trust and depend on them too much to find that (all/most/or even) many of the details meet all of the specs. The specs, after all, are huge and open to interpretation :-(

@FrankB: I very much appreciate the help on how to 'fix' the issue.
One of my difficulties with this approach, is my hope and desire to report/interpret the metadata as as found, rather than modifying the data, if at all possible. Similar to Phil's argument regarding the Software used to update any items.
So I suppose, the 'ideal' option (IMO) would be to report the problem and give the user the option to 'fix' or leave as is.
The same argument would necessarily go for the several synonyms/replacement for old tags in more recent name spaces. Or duplicating/copying/moving the old data to the new tags/name spaces.

As for getting the 'correct data on Windows, I understand your suggestion and get the same results.

As it is, I have changed the default code page on my Win 11 PC to 65001.

A second point which makes me slow to mark that suggestion as a solution, is the idea, that any utility which is meant to help an average user to inspect and understand the metadata in a file/image (he may have inherited) should not expect that user to be an expert in the finer points of the long and colorful history of metadata.

If I seem pedantic about this point, it is only because, over many years, I have been surprised at and struggling with what seems a severe deficit of attention by mainly English speaking programmers, even from some Adobe programmers, to even be aware of or consider the option/possibility of any other language but English.
Having found Exiftool eventually, the awareness by the developer and the community of different code pages and languages, is what had me leave Exiv2 on the back burner for quite a while.

The reason I chose to run this example under Linux, was exactly to see if all of these issues would go away. Quite evidently, some still need addressing :-)




ScannerBoy

After some more thinking about this issue, a number of other questions came to mind, but I'd best start other threads for each.