Really Odd Suggestion - But Could be Helpful for Many

Started by asjones, October 31, 2019, 11:43:39 AM

Previous topic - Next topic

asjones

I know that standard text files don't have any metadata. However I often have to live in them (log files, CSV, XML, JSON...).

When i run a text file though Exiftool it give the file name, date/times, and says "Error : Unknown file type"

Would you consider pulling more info for text files and the contents?
Reporting things like:
- Report the file is a text file not error
- Line endings and number of lines  by line ending type DOS (CR/LF), UNIX (LF), MAC (CR)... saw file with multiple types
- If the file is encoded as, ASCII or UTF (and if UTF-8 or UTF-16 with or without Byte Order Mark)
- I know there is controversy if Code Page can truly be detected.
- I thought there was something else, but can't remember.

Some text editors like to "convert on open" or report funny stuff.

This wold give new power and features to ExifTool

thanks for the consideration.

Alan



Phil Harvey

Hi Alan,

This would be sort of a hybrid feature, where ExifTool reports some information even though it doesn't fully recognize the type of file.  The "recognized files" feature could be expanded to accommodate this I think.  But ExifTool would still ignore files with these extensions when scanning a directory (you would have to specify other files explicitly).  For performance reasons, there would have to be a limit on how far into the file ExifTool scanned... 64 kB perhaps?  How does all of this sound?

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

greybeard

Sounds a bit like a combination of the Unix/Linux "file" and "wc" commands

asjones

I could see not reviewing text files by default (not usually needed). I was thinking more when text files were explicitly referenced. For quick checks of images and other files I keep a link to EXIFTOOL(-K) on my desktop so I can drag/drop files to it for quick review. Would not want it to scan all my text files, but drag/drop and or specific reference in a script would be nice.

For any text file specified knowing as much about it can be helpful (especially encoding types). Not sure how you could review total number of lines without reading more than the first 64kb. I know reading lines and such for a text file is different than other meta data.

I know it is an odd request, but something i have hit more times than i would think.

thanks for the thoughts on this

Alan

Phil Harvey

Hi Alan,

I've hacked this up for fun:

> exiftool tmp
    1 directories scanned
    0 image files read
> exiftool tmp/a.txt --system:all
ExifTool Version Number         : 11.75
File Type                       : TXT
File Type Extension             : txt
MIME Type                       : text/plain
Has BOM                         : No
Newline Type                    : Windows (CR/LF)
Line Count                      : 314
Word Count                      : 1256


I'll include it in the next release if people think this would be useful.

Currently I'm only looking at the first 1 kB of the file to decide whether or not it is plain text, but then the whole file is scanned to count lines/words if it is identified as a text file.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

asjones

I love what you have, I am confused if you are showing the encoding types (ASCII, UTF-8, UTF-16, UTF-32 not really used, EBCDIC ... not used much) etc

What are you counting for words (generic space between stuff or something else? Does that vary with double byte characters in UTF-16)?

I am attaching a zip file of a few text file types I think this will work.

Phil Harvey

Here's what I get.  Note that I'm not checking for UTF-16 yet:

> exiftool --system:all tmp/*
======== tmp/DOS-ASCII.txt
ExifTool Version Number         : 11.75
File Type                       : TXT
File Type Extension             : txt
MIME Type                       : text/plain
Has BOM                         : No
Newline Type                    : Windows (CR/LF)
Line Count                      : 1
Word Count                      : 2
======== tmp/EBCDIC.txt
ExifTool Version Number         : 11.75
File Type                       : TXT
File Type Extension             : txt
MIME Type                       : text/plain
Has BOM                         : No
Newline Type                    : Macintosh (CR)
Line Count                      : 2
Word Count                      : 2
======== tmp/UTF-16.txt
ExifTool Version Number         : 11.75
Error                           : Unknown file type
======== tmp/UTF-8.txt
ExifTool Version Number         : 11.75
File Type                       : TXT
File Type Extension             : txt
MIME Type                       : text/plain
Has BOM                         : No
Newline Type                    : Windows (CR/LF)
Line Count                      : 1
Word Count                      : 2
    4 image files read
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

asjones

running in a sec, but that seemed to change the dos/unix/mac setting for the ASCII/UTF options

they are unrelated though... i was expecting a dox/unix response and a line for Ascii, UTF-8, UTF16, with/without BOM, etc.



https://superuser.com/questions/301552/how-to-auto-detect-text-file-encoding

https://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding

https://softwareengineering.stackexchange.com/questions/187169/how-to-detect-the-encoding-of-a-file


Phil Harvey

Playing with this a bit more, I now have this:

> exiftool tmp -CharacterSet -linecount -wordcount
======== tmp/DOS-ASCII.txt
Character Set                   : ASCII, Windows CRLF
Line Count                      : 1
Word Count                      : 2
======== tmp/EBCDIC.txt
Character Set                   : Unknown 8-bit
Line Count                      : 2
Word Count                      : 2
======== tmp/UTF-16.txt
Character Set                   : UTF-16LE Unicode
======== tmp/UTF-8.txt
Character Set                   : ASCII, Windows CRLF
Line Count                      : 1
Word Count                      : 2
    1 directories scanned
    4 image files read


Note that I have now added .TXT to the list of supported file extensions, and only calculate the word count for 8-bit character sets.   Your UTF-8 file doesn't have any special characters, so it is in fact plain ASCII.   I'm not planning on supporting EBCDIC.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Phil Harvey

#9
...still playing with this.  Here is the current iteration:

> exiftool tmp -mimetype -byteordermark -newline -'*count'
======== tmp/DOS-ASCII.txt
MIME Type                       : text/plain; charset=us-ascii
Newline                         : Windows CRLF
Line Count                      : 1
Word Count                      : 2
======== tmp/EBCDIC.txt
MIME Type                       : text/plain; charset=unknown-8bit
Newline                         : Macintosh CR
Line Count                      : 2
Word Count                      : 2
======== tmp/UTF-16.txt
MIME Type                       : text/plain; charset=utf-16le
Byte Order Mark                 : Yes
Newline                         : Windows CRLF
======== tmp/UTF-8.txt
MIME Type                       : text/plain; charset=us-ascii
Newline                         : Windows CRLF
Line Count                      : 1
Word Count                      : 2
    1 directories scanned
    4 image files read


- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

asjones


Phil Harvey

I've just released ExifTool 11.75, which gives this output for the files you posted:

> exiftool -mime'*' -newlines -'*count' tmp
======== tmp/DOS-ASCII.txt
MIME Type                       : text/plain
MIME Encoding                   : us-ascii
Newlines                        : Windows CRLF
Line Count                      : 1
Word Count                      : 2
======== tmp/EBCDIC.txt
MIME Type                       : text/plain
MIME Encoding                   : unknown-8bit
Newlines                        : Macintosh CR
Line Count                      : 2
Word Count                      : 2
======== tmp/UTF-16.txt
MIME Type                       : text/plain
MIME Encoding                   : utf-16le
Newlines                        : Windows CRLF
======== tmp/UTF-8.txt
MIME Type                       : text/plain
MIME Encoding                   : us-ascii
Newlines                        : Windows CRLF
Line Count                      : 1
Word Count                      : 2
    1 directories scanned


- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

asjones



I really love this update, it will keep me more in one tool. However the first "real file (not samples)" I tossed at ExifTool was 305kb, and I knew had some special UTF-8 characters in it. ExifTool said it was ASCII so I did some digging and realized due to how it was sorted in this case the UTF-8 specific characters were at the 2010KB spot. When I editing the junk from the beginning it reported UTUF-8.

So a few quick questions
1. Any chance for -TXT_FILE or -LARGE  parameter to read the whole file and/or read whole file and only read text files? Or could expanding to the whole file be enabled with some other existing option?

2. What encodings are supported US-ASCII, UTF-8, UTF-16, (I assume no UTF-32)? Line ending of Win/DOS, UNIX, Mac (old).  No EBCDIC support.... Anything else detected or that is known but not detected?

3. I saw the text tag documentation, at  https://www.exiftool.org/TagNames/Text.html
However did not see the the 64KB limit or other details listed. If easy might be nice to document.

thanks for a great tool and new features!






Phil Harvey

I can maybe continue to check for non-ascii characters without a big performance hit if scanning the whole file for word/line count.  I'll look into this.

I had various versions of the documentation that mentioned that the encoding and newlines are determined from the first 1 kB (not 64 kB), but they didn't read very well.  I'll see about trying to add this again.  Also, maybe I should expand this limit to more than 1 kB.

It does check for UTF-32, but this and UTF-16 rely on an initial BOM.  So the encodings detected are:

utf-32le
utf-32be
utf-16le
utf-16be
us-ascii
utf-8
iso-8859-1
unknown-8bit

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

asjones

thanks for continuing to think about this. I always end up with the strange things in various systems and procdsses... In my "real" test file the first 700+ lines (200K) did not have a UTF-8 character so it assumed ascii.

if you are looking for line ending hopefully you could also check for characters.

thank for your help

Alan