[Originally posted by alchemy on 2009-01-13 15:02:43-08]hello,
with some 'old' jpg comming from photoshop 5, the xml generated with -X is not full-utf8
...
<File:Comment>File written by Adobe Photoshop? 5.0</File:Comment>
...
<ICC_Profile:ProfileDescription>RVB personnalis?</ICC_Profile:ProfileDescription>
...
the two bad bytes (0xA8 and 0x8E) have been replaced by '?' in this text.
thanks for this great program, please let me know if you want such an old file for testing.
[Originally posted by exiftool on 2009-01-13 16:03:17-08]
I was just doing some research on this. I'm not sure the best way
to proceed. Unfortunately, the XML specification is overly restrictive,
and doesn't allow for the binary data that exiftool returns. This
causes two problems:
1) ExifTool can not translate all values to UTF-8 because the encoding
for some values is not known (ie. the Comment value). So exiftool writes
the raw character data, which makes the XML not valid UTF-8.
for two reasons:
2) Since exiftool will output some values without translating, it may
also currently output characters which are not valid in XML (#0-#8,
#b, #c and #e-#1f). I suppose I should probably filter these out,
and I may add this in the future, but it still leaves problem number
1) unsolved.
If anyone has any suggestions, I'd like to hear them.
- Phil
[Originally posted by exiftool on 2009-01-13 17:19:36-08]
I wish this forum had an edit feature, the text
"for two reasons:" should be deleted from
my last post.
I have thought about this some more, and will add a filter to
remove the invalid XML control characters, which will solve
problem number 2), but that isn't doesn't address your original
problem of invalid UTF-8.
- Phil
[Originally posted by alchemy on 2009-01-13 18:14:01-08]
Thank you very much for a so fast answer !
I understand the problem for those 'unknown encoding' fields (I've encountered some files with mixed ansi/macroman charset into iptc fields...).
There is no solution to 'guess' the charset of a binary text but :
- I 'know' my binary string is an occidental readeable text encoded in 8 bits ansi (windows) OR 8 bits macroman (mac) OR utf8.
- an 'invalid' xml can still be loaded (at least in php) by setting $dom->recover=true (see domdocument)
- if a (binary-text)field value IS utf8 valid, I assert it is utf8 (the risk of having 2 chars in a 8 bits text equals to a 2 bytes utf8 char is realy low)
- if not (assert it is a 8 bits charset), I try to distinguish between ansi/macroman by searching for a few discriminating chars (ex: chars from 0x80 to 0xA0 are 'characters' in macroman, but 'graphics' in ansi).
Like every empirical method this is not perfect :
The main problem is not to be unable to guess the charset, but taking the risk to guess a wrong charset, thus destroying the original information.
I think that filtering (deleting or replacing) the no-utf8 bytes (in fact the no-utf8 sequences...) in order to build a utf8 valid xml is not a good idea as it will result in a loss of information.
Encoding (escaping) those bytes in a reversible way might be a better solution.
thanks again
[Originally posted by exiftool on 2009-01-13 19:42:04-08]What would you think if malformed UTF-8 values (as well as values
containing control character which are invalid in XML) were encoded like this:
<File:Comment et:encoding='base64'>RmlsZSB3cml0dGVuIGJ5IEFkb2JlIFBob3Rvc2hvcKggNS4w</File:Comment>
I am now thinking that this may be a good alternative.
- Phil
[Originally posted by alchemy on 2009-01-19 16:12:58-08]
hello,
Your idea of encoding 'unknown charset' fields in base64 might be a good (only ?) solution.
I spent a long time playing with many programs/versions/os, writing meta informations in jpg files.
I have found a file(*) generating 3 different charsets (yes, from SAME file) when dumped with -X option.
It's almost impossible to load the exiftool output in php:dom, even after 'hard-changing' the encoding in the xml text before loading (tried 'windows-1252', 'macintosh', 'iso-8859-1'...).
Since the xml can't be loaded, there is no chance to 'repair' a bad-encoded field.
by the way...
I tried the -X option to bypass a problem with the -t option, where some important control-chars (CR/LF...) are replaced by '.' in output (non reversible).
So exporting values in base64 could be usefull in many situations.
Another solution could be a 'C-like' escaping method ('\r', '\n', '\xxx',... '\\') for special chars.
ps: I will send you the file by mail, it contains french meta, where character 'ê' breaks encoding.
thanks again for your help.
[Originally posted by exiftool on 2009-01-19 16:23:28-08]
ExifTool 7.62 (already released) implements this base64 encoding of
all non-UTF8 strings. This should fix all problems with invalid XML.
- Phil
[Originally posted by crowleym on 2009-02-06 11:33:24-08]Hi
Further to this, I have a file that is generating invlaid XML, becuase the TAG NAME contains a '#'.
Output is:
<rdf:Description rdf:about='../../videos/00 - Here We Go Again.mp4'
xmlns:et='http://ns.exiftool.ca/1.0/' et:toolkit='Image::ExifTool 7.63'
xmlns:File='http://ns.exiftool.ca/File/1.0/'
xmlns:QuickTime='http://ns.exiftool.ca/QuickTime/QuickTime/1.0/'
xmlns:Track#='http://ns.exiftool.ca/QuickTime/Track#/1.0/'
xmlns:Track1='http://ns.exiftool.ca/QuickTime/Track1/1.0/'
xmlns:Track2='http://ns.exiftool.ca/QuickTime/Track2/1.0/'
xmlns:Composite='http://ns.exiftool.ca/Composite/1.0/'>
.
.
.
<QuickTime:NextTrackID
et:desc='Next Track ID'>3</QuickTime:NextTrackID>
<Track#:CreateDate
et:desc='Create Date'>0000:00:00 00:00:00</Track#:CreateDate>
<Track#:ModifyDate
et:desc='Modify Date'>0000:00:00 00:00:00</Track#:ModifyDate>
<Track1:TrackVersion
et:desc='Track Version'>0</Track1:TrackVersion>
Note: Invalid XML Element Namespace "Track#".
Preumably the metadata contains this value, but it does mean any file potentially can cause an issue.
I am using 7.63, but have also tested with 7.63.
Any ideas? Happy to supply the file for testing...
Thanks.
[Originally posted by exiftool on 2009-02-06 12:11:31-08]
This is not supposed to happen, the # should be replaced
with the track number. If you could send the file to
philharvey66 at gmail.com I will figure out what is happening.
Thanks.
[Originally posted by exiftool on 2009-02-06 16:52:53-08]
Thanks for the sample. I have located the problem and it
will be fixed in version 7.66.
- Phil