ExifTool Forum

ExifTool => Archives => Topic started by: Archive on May 12, 2010, 08:54:03 AM

Title: Warning: jpeg format error
Post by: Archive on May 12, 2010, 08:54:03 AM
[Originally posted by eck on 2007-04-16 17:34:48-07]

Hello,

I 'm running into a problem with ExifTool and utf8:

given: ActiveState Perl 5.8.6 Win2000, ExifToolVersion = 6.76

In a Webapplication I have to translate/normalize all incoming text to UTF-8. This works fine, but not with ExifTool

A necessary "use encoding 'utf8'" at the beginning makes ExifTool to bring up warnings "JPEG format error" on, you wont believe it, jpegs :-), so I cant get the images width and height. On a gif, everything works fine.

Commenting out the "use encoding 'utf8'" the jpeg-warnings are gone, but the rest of my app now dumps ugly hieroglyphs

Patching exiftool.pm with some binmode()s doesn't really help.

Any experience with the usage of utf8 and exiftool anyony?
Title: Re: Warning: jpeg format error
Post by: Archive on May 12, 2010, 08:54:03 AM
[Originally posted by exiftool on 2007-04-16 18:06:35-07]

I already use binmode() for all binary files, so this isn't the problem.

The Perl UTF support is a real headache for me, and has broken my code a number of times in the past since ExifTool deals with binary data and not characters.

As I understand it, "use encoding" has global scope.  It therefore affects encoding of character strings, etc in the ExifTool module, which is very bad and could break many things.

I've just looked into this briefly, but the following exerpt from the documentation is very worrying:

Code:
use encoding "iso 8859-7";
# \xDF in ISO 8859-7 (Greek) is \x{3af} in Unicode.
$a = "\xDF";
printf "%#x\n", ord($a); # will print 0x3af, not 0xdf

ExifTool is written with the assumption that if $a = "\xdf" then "print $a"
will print "\xdf".  If not, all hell will break loose.

But I will look into this to see if I can find a solution.

- Phil
Title: Re: Warning: jpeg format error
Post by: Archive on May 12, 2010, 08:54:03 AM
[Originally posted by exiftool on 2007-04-16 18:36:00-07]

I did a bit of playing around, and find that if I put a
"no encoding;" statement at the end of my script then
things work again (and the encoding is still valid for the
script itself).  I'm not really clear on why this happens,
but maybe you could try it.

- Phil
Title: Re: Warning: jpeg format error
Post by: Archive on May 12, 2010, 08:54:03 AM
[Originally posted by eck on 2007-04-17 08:48:15-07]

I also played a little with 'no encoding'. I capsuled the use "Image::ExifTool" and the ImageInfo-Call with "no encoding" before and "use encoding utf8" afterwards, but strange results.

And about Unicode: If everyone would change to UTF8, we would get rid of those Charset-rubbish. I'm from europe and utf8 is a pleasure in favour to use ISO-8859-15 (we need the EURO-Sign, wich is not present in 8895-1) or ISO 8859-16 (pl Characters) or ISO-8895-9 (tr characters). So the first thing I do, is convert everything to utf8. So Unicode is just a must.

US-residents, without special charactersand umlauts and ascii only have a really easy living...

I will try the "no encoding;" statement at the end of my script in the next few hours...
Title: Re: Warning: jpeg format error
Post by: Archive on May 12, 2010, 08:54:03 AM
[Originally posted by eck on 2007-04-17 15:20:47-07]

Ok, "no encoding" at the end of a stand-alone or cgi-script works fine,
but not inside apache and mod-perl

So this is not the final solution ;-)
Title: Re: Encoding problems (was Warning: jpeg format error)
Post by: Archive on May 12, 2010, 08:54:03 AM
[Originally posted by exiftool on 2007-04-18 13:15:14-07]

To answer this question definitively, I consulted a higher authority:

It is our good fortune that Tom Christiansen (co-author of "Programming Perl")
is an ExifTool user and contributor, and he was generous enough to explain
the situation in some detail.  The bottom line is that adding a "use bytes"
statement in ExifTool.pm should fix things for us.

This update will appear in ExifTool 6.87 when it is released.

He goes on to point out that this doesn't always fix everything, but if I
understand things correctly the remaining problems are due to the fact that
the encoding still applies in the calling script and to STDIN/STDOUT, so the
calling script must be careful and set binmode as required on these
filehandles.

Below I attach the full text of the email from Tom Christiansen for those
who may be interested in the details:

- Phil

------------------------------------------------------------

[Tom writes]

The short answer is put "use bytes;" in Image/ExifTool.pm, right below
where pull in File::RandomAccess and before the rest of the file.

The long answer is, well, longer.

What's happening is that your

    "\xff" . chr($marker)

code (amongst other places) is being shamelessly "promoted" into
a Unicode-encoded string, starting from an assumed ISO-8859-1.
This will produce now a 4-byte string that is "\357\277\275\0".
This is of course completely nuts.  

The bug IMHO is that use encoding is not lexically scoped.  It
affects code everywhere, and you should not think that what it
says about "no encoding" actually doing you much good, or placing
the "use encoding" after the modules are sucked in.  That's not
good enough.  For example

There's a CPAN module called "encoding::warnings" that will
find these for you.  Add "use encoding::warnings" instead of use
bytes, and you will find this:

Code:
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 748
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 749
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2178
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2179
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2317
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2320
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2320
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2489
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2522
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2529
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2579
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2607
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2608
Bytes implicitly upgraded into wide characters as iso-8859-1 at lib/Image/ExifTool.pm line 2658

I believe you can do something at all those points to get it to behave
better involving calls to specific encode or decode routines from Encode,
but it's easiest to just say use bytes.

Now, it's not *always* enough to just place a use bytes in your own module
code.  For example

Code:
   # Module BadEnc.pm
    use bytes;
    my $i = 0;
    sub main::func { return "\xff" . chr($i) }
    1;

BTW, you get a different answer writing

Code:
   sub main::func { return "\xff" . chr(0) }

than the chr($i), because it gets optimized into what is effectively

Code:
   sub main::func { return "\xff\x00" }

and so gets encoded up differently in the implicit conversion.

You can then run this:

Code:
   use BadEnc;
    use encoding "utf8";
    $word = func();
    for $i ( 0 .. (bytes::length($word)-1) ) {
   printf "char #%d has code point %d\n", $i, bytes::ord (bytes::substr($word, $i, 1))
    }
    for $i ( 0 .. (length($word)-1) ) {
   printf "char #%d has code point %d\n", $i, ord (substr($word, $i, 1))
    }
    print $word;

And you'll see that you still have a problem

Code:
   char #0 has code point 255
    char #1 has code point 0
    char #0 has code point 65533
    char #1 has code point 0

I've omitted the last line of output, but it is what gave me the
string that I ran through "od -c" to find that it's "\357\277\275\0".

BTW, placing use encoding::warnings into BadEnc.pm gives this now as output:

Code:
Bytes implicitly upgraded into wide characters as iso-8859-1 at BadEnc.pm line 4

    char #0 has code point 195
    char #1 has code point 191
    char #2 has code point 0
    char #0 has code point 255
    char #1 has code point 0

Which shows you would still have a problem.

I did test the "use bytes" in Image/ExifTool.pm, and it makes you suddenly
able to read JPEGs correctly again.

Here's my little test program:

Code:
   #!/usr/bin/perl
    use strict;
    use warnings;
    use Image::ExifTool qw(ImageInfo);
    use encoding 'utf8';  # breaks JPEG (etc) reading
    if (!@ARGV) {
   die "usage: $0 filename ...\n";
    }

    for my $filename (@ARGV) {
   my $info = ImageInfo($filename);

   if (my $error = $info->{error}) {
       warn "Can't parse image info on file $filename: $error\n";
       next;
   }

   if (my $oops = $info->{Warning}) {
       warn "WARNING: Can't parse image info on file $filename: $oops\n";
       # fallthrough
   }

   printf "%s is size %s\n", $filename, $info->{ImageSize};
    }

If you add "use bytes" to your module, the code above will now work fine,
even in the presence of the noxious use encodings.

Below is my somewhat provocative message to p5p about it, where I don't
come right out and say what the trouble is.  I'm trying to see how many
folks see the problem right away.

Supposedly there are plans to make use encoding a lexically scoped pragma
in 5.9, but I don't know how far along that is.

--tom

------- Forwarded Message

=for your consideration,

The surrounding module, Smack.pm, runs perfectly well, as demonstrated by

Code:
   % perl -MSmack -e 'smack && snarf && print "hurray!\n" '

after clipping this message body and placing it in the obviously-named
file: the expected "hurray!" is indeed correctly emitted.  But it is
a false cheer, for this innocent module nevertheless has a bug or two
lurking in it.

Can you* see the problem?  If so, is it really glaringly obvious to
everyone but me?  Has awareness of this niggling nasty passed into
general understanding?

I don't think so, but could of course be wrong.  Yet even if I *am*
mistaken (and so more of you will say "Duh, Tom!" than who will say
"D'oh, Perl!"), what are the poor module writers realistically supposed
to do about this? Must they retroactively insulate themselves from this
strange-action-at-a-distance bug?  This vexing matter may well not
even have existed back when they wrote their innocent module.

Why must module writers understand this?  I really can't see how it's
their fault.  Anything that forces everybody else all to go off and
change their existing module code can't be a good thing.  I would argue
therefore that the fault lies not in these modules, but elsewhere entirely--
pragmatically speaking, that is.

No?

- --tom

=cut

Code:
package Smack;
use strict;
use warnings;

use Exporter;
our @ISA = 'Exporter';
our @EXPORT = qw(smack snarf);

my $DEFNAME = "bindata";
my $NULL = chr(my $bye = 0);
my $RECSEP = "\xff" . $NULL;

sub smack {
    my $file = @_ ? shift : $DEFNAME;
    open(BINDATA, "> :raw", $file) || die "Can't smack > $file: $!";
    {
   local $\ = $RECSEP;
   print BINDATA "line one";
   print BINDATA "line two";
    }
    close(BINDATA) || die "can't close $file: $!";
    printf "%s is size %d (should be 20)\n" , $file, -s $file;
    return 1;
}

sub snarf {
    my $file = @_ ? shift : $DEFNAME;
    open(BINDATA, "< :raw", $file) || die "Can't snarf < $file: $!";
    local $/ = $RECSEP;
    local $_;
    while (<BINDATA>) {
   chomp;
   printf "line %d of %s: %s\n", $., $file, $_;
    }
    close BINDATA;
    return 1;
}

1;
# * where you != Audrey

------- End of Forwarded Message
Title: Re: Encoding problems (was Warning: jpeg format error)
Post by: Archive on May 12, 2010, 08:54:03 AM
[Originally posted by eck on 2007-04-18 15:49:03-07]

Hi Phil,

your (and Toms) proposal with "use bytes" works fine. I just put it into the ExifTool.pm right after the "use strict", and now everything looks good even in apache and mod_perl. Of course some more testing is required, but this looks really good to me.

Now to the hard part for me. I'll try to take a few moments to understand what happens :-)
Your post with Toms quote is something for the evening today, I hope.

Thanks for the fast help. And for ExifTool of course. :-)
Title: Re: Encoding problems (was Warning: jpeg format error)
Post by: Archive on May 12, 2010, 08:54:03 AM
[Originally posted by exiftool on 2007-04-18 16:45:33-07]

Yes, more testing is required.  In my testing so far I have determined
that the "use bytes" must be added to all ExifTool files, not just the
main module.  Also, there are a couple of places in the code where
UTF8 strings are converted and "no bytes" must be added to
prevent problems.

I have made these changes and updated the
6.87
pre-release
from my current test version.  Please use this version
if you are running tests.  I will continue to test it here as well.

Thanks.

- Phil
Title: Re: Encoding problems (was Warning: jpeg format error)
Post by: Archive on May 12, 2010, 08:54:03 AM
[Originally posted by exiftool on 2007-04-19 15:01:58-07]

This problem has resulted in an interesting thread in the perl5 developers
mailing list
(see
here
).

Two interesting comments from this exchange are:

[Tom Christiansen wrote:]
 
Code:
 I have a hunch that the reporting user's trouble is partially that he may
  misunderstand the purpose of "use encoding 'utf8'", which he reports using
  because:

      In a Webapplication I have to translate/normalize all incoming text to UTF-8.

  I think he probably should simply be setting the encoding on the stream,
  not on the script, whether via binmoding  STDOUT => utf8 or any of the
  equivalent mechanisms.

[Yves wrote:]
Code:
 [...] as a result of your probing this at this point id say that
  encoding.pm is sufficiently broken that its not worth using. IMO we
  should split it into two new packages, fix the bugs at the same time
  and advertise  encoding.pm as deprecated in favour of the new
  functionality.

So at the very least you should exercise caution with "use encoding",
and it may be a good idea to investigate the alternatives as Tom has
suggested, but regardless I believe I have fixed the associated problems
wrt ExifTool.

- Phil
Title: Re: Encoding problems (was Warning: jpeg format error)
Post by: Archive on May 12, 2010, 08:54:04 AM
[Originally posted by exiftool on 2007-04-21 23:54:32-07]

So it looks like the solution is: Don't use encoding.pm!
(It has very nasty side-effects that will break many other modules.)

After a flurry of mails (over 100! including side-branches) on this topic
in the Perl5 porters mailing list, the solution that everyone agrees on is
that encoding.pm must go, and it looks like it will soon be deprecated.

Also, I have been advised not to add "use bytes" in an attempt to patch this
since this may also have undesireable side-effects.  So I will not try to
implement a fix in an official version of ExifTool.  Instead, you should fix
your scripts so they don't need to "use encoding 'utf8'".

You should be able to get the same effect by setting UTF-8 encoding on
STDIN/STDOUT as Tom mentioned.  And if you also want your scripts
themselves to be UTF-8 encoded, you should "use utf8".

Let me know how it goes.

Below is the conclusion of this thread in the Perl5 porter's mailing list:

On Apr 21, 2007 at 03:03:24 , Dan Kogai wrote:
Folks,

Sorry for being silent.  As a maintainer, I should have said 
*something*.

On Apr 21, 2007, at 01:03 , Rafael Garcia-Suarez wrote:
> On 20/04/07, Juerd Waalboer wrote:
>> Phil Harvey wrote 2007-04-20 11:40 (-0400):
>> > So encoding.pm should DEFINITELY BE FIXED, OR REPLACED.
>>
>> We know. But it's not that easy. And use of encoding.pm isn't necessary
>> either: everyone can do without. So I can understand that this gets very
>> little priority.
>>
>> It should, IMO, be deprecated at the first opportunity. Even if there
>> isn't a replacement.
>
> I agree with this.
>
> encoding should be replaced by two separate modules.

To be honest encoding.pm is so painful I would love to see it  deprecated.
There are primarily two reasons why encoding.pm got so kludgy.

1)  Lack of lexical scope.  As of now ${^ENCODING} is not lexical.   
This makes encoding.pm unsafe to use in modules.

2)  Doing two things in one module.  It sets ${^ENCODING} AND binmode 
STD(IN|OUT).  Part of the reasons it was so made was to replace jperl 
for good.  Most jperl scripts automagically worked just by adding 
'use encoding "shiftjis";' and such.  It did its jobs and jperl is 
now hardly ever seen.  What I did not expect was so many people try 
to use it in NEW programs.

I tell you folks, encoding.pm's role is over.  It's been 5 years 
since 5.8.0 has released.  We should move onto something better.  Let 
encoding.pm rest in peace and it will definitely brings me peace.


- Phil