ExifTool Forum

ExifTool => Bug Reports / Feature Requests => Topic started by: Mac2 on February 25, 2014, 03:44:10 PM

Title: Glitch - HTM files in UTF8/UNICODE encoding always return "File Format Error"
Post by: Mac2 on February 25, 2014, 03:44:10 PM
ExifTool handles HTML files in ANSI/ANSI and produces some basic data.
If the same file is encoded in UTF8 oder 16-bit UNICODE, ExifTool always returns "File Format Error".
Title: Re: Glitch - HTM files in UTF8/UNICODE encoding always return "File Format Error"
Post by: Phil Harvey on February 25, 2014, 07:28:40 PM
I have never seen this.  Can you post a sample?

- Phil
Title: Re: Glitch - HTM files in UTF8/UNICODE encoding always return "File Format Error"
Post by: Mac2 on February 26, 2014, 02:59:13 AM
Hi, Phil

thanks for looking into this.
I have prepared two sample HTML files and attached them.
The ASCII version is processed correctly. The same file saved in UTF8 produces the "File Format Error".
Title: Re: Glitch - HTM files in UTF8/UNICODE encoding always return "File Format Error"
Post by: Phil Harvey on February 26, 2014, 07:21:36 AM
Ah, OK.  This file only has a UTF-8 BOM at the start.  In your first post you mentioned 16-bit Unicode, so I was thinking UTF-16, which I have never seen.

Adding support for a leading UTF-8 BOM is easy.  ExifTool 9.54 will allow this.

Thanks for pointing out this problem.

- Phil
Title: Re: Glitch - HTM files in UTF8/UNICODE encoding always return "File Format Error"
Post by: Mac2 on February 26, 2014, 02:06:47 PM
Hi, Phil

great. I've attached the same file in (Windows default)16-Bit Unicode and in 16-Bit Big Endian Unicode. These are rarely used in the wild, though. But the Windows default format is often used in corporate environments which process and emit data in Windows standard 16-Bit Unicode format, without converting to UTF8.
Title: Re: Glitch - HTM files in UTF8/UNICODE encoding always return "File Format Error"
Post by: Phil Harvey on February 26, 2014, 06:59:25 PM
Thanks.  I think I'll hold off implementing support for UTF-16 HTML files until there is actually a need (you don't have a need for this, do you?), because it would be a bit ugly to implement.

- Phil
Title: Re: Glitch - HTM files in UTF8/UNICODE encoding always return "File Format Error"
Post by: Mac2 on February 27, 2014, 07:52:49 AM
I doubt that 16-Bit Unicode is in wide use, if at all.

When future ExifTool versions handle UTF8 it should cover most real-world files.