Issues converting tag values from Windows character set to Linux UTF-8

Started by polaris6262, July 09, 2016, 05:04:06 PM

Previous topic - Next topic

polaris6262

This is my first posting on the ExifTool forum, so I humbly ask for your understanding if I haven't done this properly.

I use a media streaming device to play digital files on my stereo. The proprietary server software uses my Wi-Fi network to send the music file data to the player. Unfortunately the server software is only made for Windows, so although I use Linux for all of my computing needs I am forced to maintain a Windows 7 installation for playing music through the media streaming device. A secondary issue is that all the tags in my digital music files have to be compatible with Windows, so they are coded as ISO-8859-1, which I believe is essentially what is known as "Latin1" by another name. The server software has a simplistic and fairly atrocious interface for selecting music for playing but fortunately also supports playlists, so I usually pre-define those when I want to play music. I've written a fairly comprehensive script to run under Linux to easily build playlists and it works well. It uses ExifTool to extract the necessary tags from the digital music files in order to build the playlists. My problem began when I acquired some French music files. If I select them from the server software interface they play as well as any other files, but for some reason when I use my script to include them in a playlist the server software doesn't "see" those music files at all. I eventually found out that the problem was a character set conversion issue.

This is the ExifTool output when I extract the required tags from one problematic music file:

/opt/Image-ExifTool-10.22/exiftool -s -Directory -FileName -Artist -Product -Title -Duration "/Shared/Media/Music/Français/Mes Aïeux - Dégénération - Le Reel Du Fossé.wav"

Directory                       : /Shared/Media/Music/Français
FileName                        : Mes Aïeux - Dégénération - Le Reel Du Fossé.wav
Artist                          : Mes A�eux
Product                         : En Famille
Title                           : D�g�n�ration / Le Reel Du Foss�
Duration                        : 0:05:24

The issue is obviously that the music file tags are internally stored in a Windows character set while Linux operates in UTF-8. I tried to resolve the issue using the Linux iconv character encoding tool:

/opt/Image-ExifTool-10.22/exiftool -s -Directory -FileName -Artist -Product -Title -Duration "/Shared/Media/Music/Français/Mes Aïeux - Dégénération - Le Reel Du Fossé.wav" | iconv -f ISO-8859-1 -t UTF-8

Directory                       : /Shared/Media/Music/Français
FileName                        : Mes Aïeux - Dégénération - Le Reel Du Fossé.wav
Artist                          : Mes Aïeux
Product                         : En Famille
Title                           : Dégénération / Le Reel Du Fossé
Duration                        : 0:05:24

While iconv does properly convert the problematic characters, it unfortunately also mangles the filename and directory names, which I also need. I consulted the ExifTool documentation and from that I tried the "charset" parameter, which didn't solve my problem:

/opt/Image-ExifTool-10.22/exiftool -s -charset Latin1 -Directory -FileName -Artist -Product -Title -Duration "/Shared/Media/Music/Français/Mes Aïeux - Dégénération - Le Reel Du Fossé.wav"

Directory                       : /Shared/Media/Music/Français
FileName                        : Mes Aïeux - Dégénération - Le Reel Du Fossé.wav
Artist                          : Mes A�eux
Product                         : En Famille
Title                           : D�g�n�ration / Le Reel Du Foss�
Duration                        : 0:05:24

The ExifTool FAQ mentions that the above "charset" specification is intended to address issues with the external character set for tag values, so I tried the specification for the internal character set, which would seem to be more in line with what I was trying to accomplish:

/opt/Image-ExifTool-10.22/exiftool -s -charset exif=Latin1 -Directory -FileName -Artist -Product -Title -Duration "/Shared/Media/Music/Français/Mes Aïeux - Dégénération - Le Reel Du Fossé.wav"

Directory                       : /Shared/Media/Music/Français
FileName                        : Mes Aïeux - Dégénération - Le Reel Du Fossé.wav
Artist                          : Mes A�eux
Product                         : En Famille
Title                           : D�g�n�ration / Le Reel Du Foss�
Duration                        : 0:05:24

Again, this didn't help. Considering that ExifTool seems to be able to do most anything, I'm almost positive that there must be a way to do this and I'm just not able to see it.

I would greatly appreciate a bit of help with this. Many thanks in advance!

Phil Harvey

The question is: What metadata type are you working with?  Try adding -G to your command so I can see where these tags are coming from.   I'm hoping it is not RIFF because ExifTool doesn't convert special characters in RIFF.  (I don't know if I've seen any specification for how to deal with these.)

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

polaris6262

Thank you kindly for the prompt reply. Unfortunately it turns out the tags are indeed RIFF:

/opt/Image-ExifTool-10.22/exiftool -s -G -Directory -FileName -Artist -Product -Title -Duration "/Shared/Media/Music/Français/Mes Aïeux - Dégénération - Le Reel Du Fossé.wav"

[File]          Directory                       : /Shared/Media/Music/Français
[File]          FileName                        : Mes Aïeux - Dégénération - Le Reel Du Fossé.wav
[RIFF]          Artist                          : Mes A�eux
[RIFF]          Product                         : En Famille
[RIFF]          Title                           : D�g�n�ration / Le Reel Du Foss�
[Composite]     Duration                        : 0:05:24

This means I'll probably have to somehow split the data stream in half and push all the RIFF data through iconv and merge it back with the FILE data. I was hoping to avoid having to do something like that, but it may turn out to be the only solution. Too bad ExifTool can't use iconv - or something like it - to do this internally.

Phil Harvey

Do you have a small sample you can send me (philharvey66 at gmail.com)?  I can add support for this if I can figure out how the RIFF special characters should be handled.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

polaris6262

I did a bit of reading on RIFF; it's a generic file container format for storing data in tagged chunks with ASCII headers and lengths in little-endian multi-byte integers. As you mentioned yourself, I haven't found any information on how special characters should be handled.

As for a sample, what would that consist of? The entire WAV file or an extract of some sort? If the latter, I'd need to know how to go about it.

ExifTool has been extremely useful for my playlist generation needs. Currently, I am using it very inappropriately by invoking it repeatedly for every music file, which you recommend against as it does add considerable overhead. I can greatly speed up processing by having ExifTool extract all the data tags I need from every music file in one pass by having it  recursively read all the music files from every directory.

Phil Harvey

The sample would be a whole wav file.  If you can write metadata yourself, take the smallest WAV file you have and write something with special characters to it.

I took a look at one of my RIFF format specification references, and there is mention of a "CSET" chunk which contains the character set information, but I don't have this in any on of my RIFF samples here.  But lacking this, the best alternative may be to add a -charset RIFF option that would allow you to specify the RIFF character set yourself.  In most cases this is likely ISO-8859-1, but I suspect that this could depend on the specific Windows system settings when the file was written.

- Phil

...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

polaris6262

All right, I've produced a small sample of the original audio file but with the same data tags and I've just send it to the email address you provided. I use Roxio Creator 2011 Pro to edit the tags under Windows. I don't really like the program as I prefer what's available under Linux, but it does produce tag data that my proprietary music server can read. If only I could also process those using ExifTool under Linux then everything would be fine.

Phil Harvey

OK, thanks.  I'll post back here after I get the file and have had a chance to work on this.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

I got the file, thanks.

I couldn't find any indication of the character set, so I will add a -charset RIFF option, and change ExifTool 10.23 to assume Latin by default.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

polaris6262

Excellent! You know, I never actually knew that the Waveform Audio File Format (WAVE, or more commonly known as WAV) was actually a Microsoft implementation, a subset of Microsoft's RIFF specification for the storage of multimedia files. I mostly use it because it's very nearly ubiquitous (like FAT, another almost-universal Microsoft implementation) and most importantly because it's lossless. It resembles AIFF (Audio Interchange File Format), which is its counterpart from Apple for use with its Macintosh operating system.

So, knowing this, I'm not very surprised that it would not contain any indication of a character set since WAV was essentially made to be used under Windows, and the latter's preferred character set is almost always Latin. Still, in your ExifTool -charset RIFF option, it might not be a bad idea (if you feel so inclined) to give the user the option of overriding the default Latin character set if said user happens to know the actual character set used in the WAV file under consideration.

Phil Harvey

Yes.  The Latin will be the default, but the new option will allow you to set it to any supported character set.  The new version will be available within about a week.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

polaris6262

That really sounds awesome Phil. I'm very grateful for you to tackle this issue and I'm looking forward to the new functionality. Of course it will mean a major upgrade to my own playlist generation script, but the improvement in perfiormance and flexibility will more than make this worthwhile.

Again, many thanks for your commitment. i'm sure I'm only joining others in expressing my appreciation for such a fine tool.