Getting Malformed URL Characters Running with Code Page 65001

Started by Kenneth Evans, July 16, 2018, 01:41:29 PM

Previous topic - Next topic

Kenneth Evans

On windows 10 64-bit I am running ExifTool with the following batch script:

chcp 1252
set EXIFTOOL=c:\bin\EXIFTool\exiftool.exe
set SRC="Coons 2018.orig.jpg"
set DEST="Coons 2018.exiftool3.jpg"
set COPYRIGHT=Copyright © 2018 Kenneth Evans All Rights Reserved
copy %SRC% %DEST%
%EXIFTOOL% -charset utf8 -artist="Kenneth Evans" -copyright="%COPYRIGHT%" -copyrightnotice="%COPYRIGHT%" -rights="%COPYRIGHT%" -UsageTerms="All Rights Reserved" -Marked="true" %DEST%
chcp 65001
%EXIFTOOL% -filename -artist -copyright -copyrightnotice -rights %DEST%


The BAT file is UTF-8 according to Notepad++.  On doing a hex dump, the characters are single-byte except the copyright symbol (C2 A9).

It runs as is and gives the ExifTool output I would like (actual c-in-circle copyright symbols) both in the output with code page 2001 and in ExifToolGui.  However, the code page is Latin and the copyright symbol shows up as © in the echo statements from the script.  (My understanding is that ExifTool will be assuming the input is Latin, not UTF-8.)

If I change the first line to chcp 65001 (UTF-8), then the echo output is as you would expect (copyright symbol is a ©), but I get:

Warning: Malformed UTF-8 character(s) - Coons 2018.exiftool3.jpg

and the output is:

Artist                          : Kenneth Evans
Copyright                       : Copyright  2018 Kenneth Evans All Rights Reserved
Copyright Notice                : Copyright
Rights                          : Copyright ? 2018 Kenneth Evans All Rights Reserved

So I get the wrong results when everything is UTF-8 and the right results when the code page is 1252 (Latin).  What am I doing wrong or failing to understand?

Thanks.

Phil Harvey

Your console is set to cp1252 but you have told ExifTool that you are entering characters in UTF8.

It looks like you should use -charset cp1252 instead of -charset utf8 when writing.

- Phil

Edit:  But you say your bat file is UTF8.  Then why are you setting cp1252 and not cp65001 at the start?  I must admit, I haven't tried doing this in a .bat file.
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Kenneth Evans

Thanks for the fast reply.  You would think so but that doesn't work.  (EXIF Copyright has a bad character, other two ok.)  What I wrote is what works.

I've done a lot of reading and trial & error by now.  ;)

I would like to use chcp 65001 and not specify -charset utf8.  Why doesn't that work?

Phil Harvey

I have no idea.  This is really more of a Windows question.

But specifying -charset utf8 should have no effect since this is the default.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Kenneth Evans

#4
Removing -charset utf8 doesn't change anything with the script as is (using chcp 1252).  The only real issue is that the command output has ©, whereas with chcp 65001, it is ©.

It may be a Windows question, but ExifTool is doing something different in the two cases.  With 65001, I see © and get a a malformed character.  With 1252 I see a malformed character © and get ©.

I would expect it to be the other way around. it would be nice to understand what is happening, so I don't get bit down the road by doing something that doesn't make sense.

Added later: How does ExifTool determine the input charset?  In both of my cases the bytes it is getting are C2 A9 for ©.

Phil Harvey

The -charset option is how ExifTool determines the input character set.

I'll have to try this in Windows to be able to comment more intelligently on what is happening, but it may be a while before I can do that.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Kenneth Evans

Quote from: Phil Harvey on July 16, 2018, 04:21:09 PM
I'll have to try this in Windows to be able to comment more intelligently on what is happening, but it may be a while before I can do that.
Thanks.

Kenneth Evans

I worked on this further.  The BAT file can all be chcp 65001 (UTF-8) except for setting the copyright:

This works:

@chcp 1252 > nul
set COPYRIGHT=Copyright © %YEAR% Kenneth Evans All Rights Reserved
@chcp 65001 > nul


The © in the code excerpt is a 2-byte UTF-8 © and the BAT file itself is UTF-8, done in Notepad++.

So you are right, it seems to be a Windows thing.  I think the console is able to display things in a particular code page and language, but does its own thing under the covers.  I am not an expert and avoid it if I can.

It would be interesting to know what Exiftool gets as input, that is, what causes it to print that it encountered malformed characters  (when it has chcp 65001 at the top and not using the chcp 1252 in the excerpt).

Phil Harvey

I'm glad you figured this out.  The exiftool -echo option may be useful to see what ExifTool sees.  Try something like this:

exiftool -echo "Copyright ©" > out.txt

You should be able to do the same thing with the built-in "echo" command.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Kenneth Evans

I did this:

set YEAR=2018

chcp 1252
set COPYRIGHT=Copyright © %YEAR% Kenneth Evans All Rights Reserved
echo %COPYRIGHT% | hexdump -C
exiftool -echo "Copyright ©" > test1252.txt
hexdump -C test1252.txt

chcp 65001
set COPYRIGHT=Copyright © %YEAR% Kenneth Evans All Rights Reserved
echo %COPYRIGHT% | hexdump -C
exiftool -echo "Copyright ©" > test65001.txt
hexdump -C test65001.txt


These are the results:

C:\bin\EXIFTool>TestChcp.bat

C:\bin\EXIFTool>set YEAR=2018

C:\bin\EXIFTool>chcp 1252
Active code page: 1252

C:\bin\EXIFTool>set COPYRIGHT=Copyright © 2018 Kenneth Evans All Rights Reserved

C:\bin\EXIFTool>echo Copyright © 2018 Kenneth Evans All Rights Reserved   | hexdump -C
00000000  43 6f 70 79 72 69 67 68  74 20 c2 a9 20 32 30 31  |Copyright .. 201|
00000010  38 20 4b 65 6e 6e 65 74  68 20 45 76 61 6e 73 20  |8 Kenneth Evans |
00000020  41 6c 6c 20 52 69 67 68  74 73 20 52 65 73 65 72  |All Rights Reser|
00000030  76 65 64 20 0d 0a                                 |ved ..|
00000036

C:\bin\EXIFTool>exiftool -echo "Copyright ©"  1>test1252.txt

C:\bin\EXIFTool>hexdump -C test1252.txt
00000000  43 6f 70 79 72 69 67 68  74 20 c2 a9 0d 0a        |Copyright ....|
0000000e

C:\bin\EXIFTool>chcp 65001
Active code page: 65001

C:\bin\EXIFTool>set COPYRIGHT=Copyright © 2018 Kenneth Evans All Rights Reserved

C:\bin\EXIFTool>echo Copyright © 2018 Kenneth Evans All Rights Reserved   | hexdump -C
00000000  43 6f 70 79 72 69 67 68  74 20 c2 a9 20 32 30 31  |Copyright .. 201|
00000010  38 20 4b 65 6e 6e 65 74  68 20 45 76 61 6e 73 20  |8 Kenneth Evans |
00000020  41 6c 6c 20 52 69 67 68  74 73 20 52 65 73 65 72  |All Rights Reser|
00000030  76 65 64 20 0d 0a                                 |ved ..|
00000036

C:\bin\EXIFTool>exiftool -echo "Copyright ©"  1>test65001.txt

C:\bin\EXIFTool>hexdump -C test65001.txt
00000000  43 6f 70 79 72 69 67 68  74 20 a9 0d 0a           |Copyright ...|
0000000d

C:\bin\EXIFTool>


So it looks like Exiftool is getting the same bytes either way, but the results of your suggested test are different.  In chcp 65001 it is losing the c2 byte.

Phil Harvey

Interesting.  Thanks for running this test.

I can't explain the difference.  All I can tell you is that the exiftool -echo command echos back exactly the characters that exiftool gets from the command line without any recoding (by exiftool that is -- I can't speak for the shell).  Obviously this is somehow different from what the built-in echo command is doing.  I must admit that I really don't understand how the Windows command shell handles character encoding.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Kenneth Evans

I also don't understand how the Windows command shell works internally, but I am not seeing anything anomalous from what I would expect in the shell part, just in what Exiftool does.

It looks like Exiftool is getting both bytes of © in either case, based on the shell output lines.  It doesn't make sense that it is dropping the first byte when in chcp 65001, the code page you would expect to work right.

Phil Harvey

...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

Note that FAQ 18 mentions this:

Note that Windows will recode arguments on the command line from the current console code page to the system code page

Which may explain why the c2 is dropped when you chcp 65001.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Kenneth Evans

Yes, I saw FAQ 18.  It essentially says to use chcp 65001.  ;)

I could be wrong, but I have heard Windows uses wide characters internally.  In any case it should be doing the same thing both ways.  It is my guess that Perl is doing it, but that's just a guess.

In any event I have a work around.

This is the first I have used Exiftool more than superficially.  I am impressed.  Thanks.