[Originally posted by exiftool on 2007-04-16 18:06:35-07]I already use binmode() for all binary files, so this isn't the problem.
The Perl UTF support is a real headache for me, and has broken my code a number of times in the past since ExifTool deals with binary data and not characters.
As I understand it, "use encoding" has global scope. It therefore affects encoding of character strings, etc in the ExifTool module, which is very bad and could break many things.
I've just looked into this briefly, but the following exerpt from the documentation is very worrying:
use encoding "iso 8859-7";
# \xDF in ISO 8859-7 (Greek) is \x{3af} in Unicode.
$a = "\xDF";
printf "%#x\n", ord($a); # will print 0x3af, not 0xdf
ExifTool is written with the assumption that if $a = "\xdf" then "print $a"
will print "\xdf". If not, all hell will break loose.
But I will look into this to see if I can find a solution.
- Phil
[Originally posted by exiftool on 2007-04-18 13:15:14-07]To answer this question definitively, I consulted a higher authority:
It is our good fortune that Tom Christiansen (co-author of "Programming Perl")
is an ExifTool user and contributor, and he was generous enough to explain
the situation in some detail. The bottom line is that adding a "use bytes"
statement in ExifTool.pm should fix things for us.
This update will appear in ExifTool 6.87 when it is released.
He goes on to point out that this doesn't always fix everything, but if I
understand things correctly the remaining problems are due to the fact that
the encoding still applies in the calling script and to STDIN/STDOUT, so the
calling script must be careful and set binmode as required on these
filehandles.
Below I attach the full text of the email from Tom Christiansen for those
who may be interested in the details:
- Phil
------------------------------------------------------------
[Tom writes]
The short answer is put "use bytes;" in Image/ExifTool.pm, right below
where pull in File::RandomAccess and before the rest of the file.
The long answer is, well, longer.
What's happening is that your
"\xff" . chr($marker)
code (amongst other places) is being shamelessly "promoted" into
a Unicode-encoded string, starting from an assumed ISO-8859-1.
This will produce now a 4-byte string that is "\357\277\275\0".
This is of course completely nuts.
The bug IMHO is that use encoding is not lexically scoped. It
affects code everywhere, and you should not think that what it
says about "no encoding" actually doing you much good, or placing
the "use encoding" after the modules are sucked in. That's not
good enough. For example
There's a CPAN module called "encoding::warnings" that will
find these for you. Add "use encoding::warnings" instead of use
bytes, and you will find this:
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 748
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 749
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2178
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2179
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2317
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2320
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2320
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2489
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2522
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2529
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2579
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2607
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2608
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2658
I believe you can do something at all those points to get it to behave
better involving calls to specific encode or decode routines from Encode,
but it's easiest to just say use bytes.
Now, it's not *always* enough to just place a use bytes in your own module
code. For example
# Module BadEnc.pm
use bytes;
my $i = 0;
sub main::func { return "\xff" . chr($i) }
1;
BTW, you get a different answer writing
sub main::func { return "\xff" . chr(0) }
than the chr($i), because it gets optimized into what is effectively
sub main::func { return "\xff\x00" }
and so gets encoded up differently in the implicit conversion.
You can then run this:
use BadEnc;
use encoding "utf8";
$word = func();
for $i ( 0 .. (bytes::length($word)-1) ) {
printf "char #%d has code point %d\n", $i, bytes::ord (bytes::substr($word, $i, 1))
}
for $i ( 0 .. (length($word)-1) ) {
printf "char #%d has code point %d\n", $i, ord (substr($word, $i, 1))
}
print $word;
And you'll see that you still have a problem
char #0 has code point 255
char #1 has code point 0
char #0 has code point 65533
char #1 has code point 0
I've omitted the last line of output, but it is what gave me the
string that I ran through "od -c" to find that it's "\357\277\275\0".
BTW, placing use encoding::warnings into BadEnc.pm gives this now as output:
Bytes implicitly upgraded into wide characters as iso-8859-1 at BadEnc.pm line 4
char #0 has code point 195
char #1 has code point 191
char #2 has code point 0
char #0 has code point 255
char #1 has code point 0
Which shows you would still have a problem.
I did test the "use bytes" in Image/ExifTool.pm, and it makes you suddenly
able to read JPEGs correctly again.
Here's my little test program:
#!/usr/bin/perl
use strict;
use warnings;
use Image::ExifTool qw(ImageInfo);
use encoding 'utf8'; # breaks JPEG (etc) reading
if (!@ARGV) {
die "usage: $0 filename ...\n";
}
for my $filename (@ARGV) {
my $info = ImageInfo($filename);
if (my $error = $info->{error}) {
warn "Can't parse image info on file $filename: $error\n";
next;
}
if (my $oops = $info->{Warning}) {
warn "WARNING: Can't parse image info on file $filename: $oops\n";
# fallthrough
}
printf "%s is size %s\n", $filename, $info->{ImageSize};
}
If you add "use bytes" to your module, the code above will now work fine,
even in the presence of the noxious use encodings.
Below is my somewhat provocative message to p5p about it, where I don't
come right out and say what the trouble is. I'm trying to see how many
folks see the problem right away.
Supposedly there are plans to make use encoding a lexically scoped pragma
in 5.9, but I don't know how far along that is.
--tom
------- Forwarded Message
=for your consideration,
The surrounding module, Smack.pm, runs perfectly well, as demonstrated by
% perl -MSmack -e 'smack && snarf && print "hurray!\n" '
after clipping this message body and placing it in the obviously-named
file: the expected "hurray!" is indeed correctly emitted. But it is
a false cheer, for this innocent module nevertheless has a bug or two
lurking in it.
Can you* see the problem? If so, is it really glaringly obvious to
everyone but me? Has awareness of this niggling nasty passed into
general understanding?
I don't think so, but could of course be wrong. Yet even if I *am*
mistaken (and so more of you will say "Duh, Tom!" than who will say
"D'oh, Perl!"), what are the poor module writers realistically supposed
to do about this? Must they retroactively insulate themselves from this
strange-action-at-a-distance bug? This vexing matter may well not
even have existed back when they wrote their innocent module.
Why must module writers understand this? I really can't see how it's
their fault. Anything that forces everybody else all to go off and
change their existing module code can't be a good thing. I would argue
therefore that the fault lies not in these modules, but elsewhere entirely--
pragmatically speaking, that is.
No?
- --tom
=cut
package Smack;
use strict;
use warnings;
use Exporter;
our @ISA = 'Exporter';
our @EXPORT = qw(smack snarf);
my $DEFNAME = "bindata";
my $NULL = chr(my $bye = 0);
my $RECSEP = "\xff" . $NULL;
sub smack {
my $file = @_ ? shift : $DEFNAME;
open(BINDATA, "> :raw", $file) || die "Can't smack > $file: $!";
{
local $\ = $RECSEP;
print BINDATA "line one";
print BINDATA "line two";
}
close(BINDATA) || die "can't close $file: $!";
printf "%s is size %d (should be 20)\n" , $file, -s $file;
return 1;
}
sub snarf {
my $file = @_ ? shift : $DEFNAME;
open(BINDATA, "< :raw", $file) || die "Can't snarf < $file: $!";
local $/ = $RECSEP;
local $_;
while (<BINDATA>) {
chomp;
printf "line %d of %s: %s\n", $., $file, $_;
}
close BINDATA;
return 1;
}
1;
# * where you != Audrey
------- End of Forwarded Message
[Originally posted by exiftool on 2007-04-19 15:01:58-07]This problem has resulted in an interesting thread in the perl5 developers
mailing list
(
see
here).
Two interesting comments from this exchange are:
[Tom Christiansen wrote:]
I have a hunch that the reporting user's trouble is partially that he may
misunderstand the purpose of "use encoding 'utf8'", which he reports using
because:
In a Webapplication I have to translate/normalize all incoming text to UTF-8.
I think he probably should simply be setting the encoding on the stream,
not on the script, whether via binmoding STDOUT => utf8 or any of the
equivalent mechanisms.
[Yves wrote:]
[...] as a result of your probing this at this point id say that
encoding.pm is sufficiently broken that its not worth using. IMO we
should split it into two new packages, fix the bugs at the same time
and advertise encoding.pm as deprecated in favour of the new
functionality.
So at the very least you should exercise caution with "use encoding",
and it may be a good idea to investigate the alternatives as Tom has
suggested, but regardless I believe I have fixed the associated problems
wrt ExifTool.
- Phil