ExifTool Forum

ExifTool => Bug Reports / Feature Requests => Topic started by: francois on December 20, 2015, 02:01:16 PM

Title: Wrong character encoding reading PostScript tags in Illustrator file
Post by: francois on December 20, 2015, 02:01:16 PM
Setup:
ExifTool 10.07
Mac OS X 10.10.5
Adobe Illustrator CC 19.2.0

From sample file attached, raw data (from hex editor):

... 19.2.0 %%For: (Fran\615ois Bonzon \616\617) () %%Title: ...

Actual in ExifTool:

[PostScript]    For                             : Fran?ois Bonzon ??,

Expected in ExifTool:

[PostScript]    For                             : François Bonzon éè

I don't know about the specification for these tags, but note the following:

Values seen:

octal hex    character
\615  0x18D  ç
\616  0x18E  é
\617  0x18F  è

Mac OS Roman encoding:

0x8D ç
0x8E é
0x8F è

For info, this "For" tag is automatically populated from the user account full name, as specified in Mac OS preferences.

Second, why is this comma added after the name? Should be removed.
Title: Re: Wrong character encoding reading PostScript tags in Illustrator file
Post by: Phil Harvey on December 21, 2015, 09:00:23 AM
The "For" information is stored as a PostScript comment inside this PDF, and the raw data looks like this:

%%For: (Fran\615ois Bonzon \616\617) ()

Which explains both why the characters don't show up properly (ExifTool currently does no conversion for PostScript strings), and why there is a trailing comma (there are 2 items in the "For" list, and the second one is empty).

I did some more reading of the PostScript specification, and I can find no reference for the character set used for PostScript comments.  Without knowing this, it is difficult for ExifTool to deal with special characters in this type of information.

- Phil
Title: Re: Wrong character encoding reading PostScript tags in Illustrator file
Post by: francois on January 07, 2016, 10:00:36 PM
I did some reading of PostScript specs too, and didn't find either an explanation for these octal values above \377 (above 255 decimal).

It may be an encoding specific to Adobe or Apple ? In case you want to handle it, I confirm it is based on MacRoman. I tested all MacRoman characters. The encoding is the following:

- Printable ASCII characters left as is, except parentheses and backslash are escaped as they have special meaning in PostScript language: \(  \)  \\

- Characters in MacRoman but not ASCII with an octal escape sequence, where the value is 0x100 (or \400 in octal) above the value in true MacRoman encoding:
\600  Ä
\601  Å
\602  Ç
... down to
\775  ˝
\776  ˛
\777  ˇ

- All other characters discarded, e.g. UTF-8 characters not in MacRoman
Title: Re: Wrong character encoding reading PostScript tags in Illustrator file
Post by: Phil Harvey on January 07, 2016, 10:46:37 PM
I would guess that if you did the same test on Windows the encoding would be the native Windows encoding.  If true, this is just fine for using the files on a local computer, but it would make it impossible for ExifTool to decode these strings without additional user input.

- Phil