Duplicate Carriage Return (0x0D) when extracting to XML with -X

Started by Mac2, August 08, 2011, 11:51:54 AM

Previous topic - Next topic

Mac2

Hello, I would be grateful for some help with this:

I'm using ET 8.61 under Windows, extracting metadata from image files by driving the ET command line app via -@ argfiles.

When I extract -xmp:Description from a JPEG file via

exiftool -X -xmp:description "1.jpg" > 1.xml

and the description contains carriage-return/linefeed pairs (0x0D,0x0A) the resulting XMP file contains 0x0D,0x0D,0x0A for each of these pairs, duplicating the 0x0D.
This also happens when I use -b. It does not happen when I export to JSON format or to plain text using -b.

Why comes the 0x0D out duplicated?


Phil Harvey

ExifTool is writing 0x0d+0x0a and the 0x0a is getting translated to 0x0d+0x0a by the Windows write library function.  But this will happen only for text-mode output files.  The console output should be set to binary mode with the -b option, so I don't understand why this is happening with -b.  From your tests, it looks like something specific to the XMP writer, but offhand I can't think of what could cause this.

It will be a while before I have access to a Windows machine to test this, but I will put this problem on my list and get to it as soon as I can.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Mac2

Thanks Phil. Not that urgent, I can easily filter out the extra 0x0D. Enjoy your vacation  :D

I made some additional tests to confirm my findings.
The source image has single 0x0d,0x0a pairs for XMP description and IPTC caption/abstract tags.
I created the metadata in this file via ExifTool.

When reading the file with -X every 0x0d is duplicated in the resulting XML file, even if -b is added (which I don't want).
JSON output is correct and properly emits \r\n for these CRLF pairs.
Text output (redirected from the command line in Windows via > ) using -b also has the correct 0x0d,0x0a pairs.

Phil Harvey

Quote from: Mac2 on August 09, 2011, 04:38:57 AM
When reading the file with -X every 0x0d is duplicated in the resulting XML file, even if -b is added (which I don't want).

You don't want to use the -b option?  I would understand this, but exiftool really must write to console in text mode by default, so I don't see a way around it.

This is a bit of a tricky problem.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Mac2

I'm no against using -b by principle  :)

But using -b creates tons of extra data in the XML file which I usually don't want. (And the CRLF pairs in xmp-dc:description or IPTC:caption/abstract are not base64-encoded anyway). I use only XML as the output/transfer format, I never dump to the Windows console.

I work with RAW files of all sorts, and for example extracting all data from a .NEF with -b into XML gives a lot of extra data which ExifTool thankfully otherwise just mentions with a "use -b to extract" notice. This extra data does not harm much. But with -b it will.

My simple "give me all tags you know as XML" is a neat solution and it works now and in the future.

With -b the XML files become much larger.
For example, the ExifTool output for a NEF without -b gives a 50 KB file. With -b the XML file becomes over 2 MB.

To handle this would require me to maintain some sort of exclusion list for tags I don't want in the XML output and that for many different current and future RAW and other formats. ExifTool cannot know which tags I'm not interested in when I specify -b so I would need to feed in --TAG statements to skip the unwanted binary payload. I'm not sure if maintaining such lists for the many different ever-changing formats out there would be ideal.

That's why I said I don't want to use -b  :o


Phil Harvey

I understand.

To solve this properly may require a new exiftool option which allows you to specify a file for the console output.  Of course, as soon as I make this a binary file then Windows people will complain about the newlines not being CR/LF.

But I'll think about this.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Mac2

Hi, Phil

I'm on Windows, so I may complain too  ::)

Nah, frankly I only want to get the CR/LF correct. Or if Exiftool always emits LF only, I can live with that too. I just translate to CRLF for Windows then on import.

I just made ExifTool accept CR/LF and " in textdata in arg files by HTML-encoding them and adding -ex option.
They show up in the image, all correct now.

When I now get them back in the XML output I'll be a happy camper.
Or, if you define that 0x0d,0x0a will aways come out as 0x0d,0x0d,0x0a in XML, I can add a cleanup routine and be done.

Or I look into the JSON route. I have all the XML import already working, but JSON is not hard to parse and it emits the CRLF nicely as \r\n without a chance for the Windows console / redirection mechanisms to interfere. May have other drawbacks, though. I dunno yet.


Phil Harvey

Sorry for the delay, but I needed to wait until I had access to a Windows machine for testing.

There is a simple patch I can apply to fix this in the -X output in Windows.  I will do this, and ExifTool 8.62 will contain this patch when it is released.  The patch involves simply converting 0x0d+0x0a to 0x0a in tag values before writing them to the console (on Windows systems only).  Windows will then convert them back to 0x0d+0x0a.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Mac2

Thanks, Phil  :)

The Windows console is always special. Sigh.  :-\

Perhaps a later version of ExifTool should consider an option to directly write output to a file, instead of using the console.
This would give you full control over all aspects of the output data, including binary data and non-printable characters. I don't know how much work this would cause for you, though.

Phil Harvey

Actually, I have the same control over console output, but since XML is text based I leave the console in text mode for XML output even when the -b option is used.  (I didn't remember this before, but it makes sense that -b doesn't help here.)

And as you may know, the -w option already allows you to send the console output to a file.  But this option makes a separate file for each input file, and what we need is one output file for all input files.  This is actually easy to do, but would require adding a new option, which I don't like to do.  If I could re-design the -w option, I could easily add this feature, but I would have to break the backward compatibility, which I also don't like to do (even less than adding a new option).

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Mac2

which I also don't like to do (even less than adding a new option).

I understand. ExifTool has quite a lot of options already. That's coming from someone who just had to read all the docs and command line stuff  ::)
I still would see a case for getting around all the console mechanics, e.g. by adding a -wa option (like -w all in one file).

I think that ExifTool is one of the best inventions since sliced bread. Utilizing all that power via the command line just sometimes makes things a bit more difficult - because of how the console works in Windows. A direct parameter input and result output via files would work around that. But I'm sure you're busy enough as it is. Thanks for giving us ExifTool. The more apps and services use it, the better the metadata quality will be. Which benefits us all.

pb

Quote from: Phil Harvey on August 19, 2011, 09:53:58 AM
Sorry for the delay, but I needed to wait until I had access to a Windows machine for testing.

There is a simple patch I can apply to fix this in the -X output in Windows.  I will do this, and ExifTool 8.62 will contain this patch when it is released.  The patch involves simply converting 0x0d+0x0a to 0x0a in tag values before writing them to the console (on Windows systems only).  Windows will then convert them back to 0x0d+0x0a.

- Phil
What happens if the tag value contains only 0x0a in the first place?  Seems like Windows will also convert that, and someone else will complain?  (Which is why I have a general policy to steer clear of any unrequested "favors" Windows does for me.)

Phil Harvey

ExifTool reads and writes all source files in binary mode, which means that the Windows system doesn't mess with the newline characters.  ExifTool will preserve the orginal newline byte sequence when reading.

The problem is only in the console output, which is text mode by default, so Windows tinkers with these newlines.  But this is unfortunately necessary to be able to display the text properly on Windows systems.

With version 8.62, here is what will happen with the -X option output:

0x0d+0x0a will be written as 0x0d+0x0a

0x0d will be written as 0x0d

0x0a will be written as 0x0d+0x0a <-- this is the standard windows text conversion

- Phil

...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).