ExifTool Forum

ExifTool => Bug Reports / Feature Requests => Topic started by: asjones on October 31, 2019, 11:43:39 AM

Title: Really Odd Suggestion - But Could be Helpful for Many
Post by: asjones on October 31, 2019, 11:43:39 AM
I know that standard text files don't have any metadata. However I often have to live in them (log files, CSV, XML, JSON...).

When i run a text file though Exiftool it give the file name, date/times, and says "Error : Unknown file type"

Would you consider pulling more info for text files and the contents?
Reporting things like:
- Report the file is a text file not error
- Line endings and number of lines  by line ending type DOS (CR/LF), UNIX (LF), MAC (CR)... saw file with multiple types
- If the file is encoded as, ASCII or UTF (and if UTF-8 or UTF-16 with or without Byte Order Mark)
- I know there is controversy if Code Page can truly be detected.
- I thought there was something else, but can't remember.

Some text editors like to "convert on open" or report funny stuff.

This wold give new power and features to ExifTool

thanks for the consideration.

Alan


Title: Re: Really Odd Suggestion - But Could be Helpful for Many
Post by: Phil Harvey on October 31, 2019, 12:28:07 PM
Hi Alan,

This would be sort of a hybrid feature, where ExifTool reports some information even though it doesn't fully recognize the type of file.  The "recognized files" feature could be expanded to accommodate this I think.  But ExifTool would still ignore files with these extensions when scanning a directory (you would have to specify other files explicitly).  For performance reasons, there would have to be a limit on how far into the file ExifTool scanned... 64 kB perhaps?  How does all of this sound?

- Phil
Title: Re: Really Odd Suggestion - But Could be Helpful for Many
Post by: greybeard on October 31, 2019, 01:16:13 PM
Sounds a bit like a combination of the Unix/Linux "file" and "wc" commands
Title: Re: Really Odd Suggestion - But Could be Helpful for Many
Post by: asjones on October 31, 2019, 01:32:57 PM
I could see not reviewing text files by default (not usually needed). I was thinking more when text files were explicitly referenced. For quick checks of images and other files I keep a link to EXIFTOOL(-K) on my desktop so I can drag/drop files to it for quick review. Would not want it to scan all my text files, but drag/drop and or specific reference in a script would be nice.

For any text file specified knowing as much about it can be helpful (especially encoding types). Not sure how you could review total number of lines without reading more than the first 64kb. I know reading lines and such for a text file is different than other meta data.

I know it is an odd request, but something i have hit more times than i would think.

thanks for the thoughts on this

Alan
Title: Re: Really Odd Suggestion - But Could be Helpful for Many
Post by: Phil Harvey on October 31, 2019, 02:41:49 PM
Hi Alan,

I've hacked this up for fun:

> exiftool tmp
    1 directories scanned
    0 image files read
> exiftool tmp/a.txt --system:all
ExifTool Version Number         : 11.75
File Type                       : TXT
File Type Extension             : txt
MIME Type                       : text/plain
Has BOM                         : No
Newline Type                    : Windows (CR/LF)
Line Count                      : 314
Word Count                      : 1256


I'll include it in the next release if people think this would be useful.

Currently I'm only looking at the first 1 kB of the file to decide whether or not it is plain text, but then the whole file is scanned to count lines/words if it is identified as a text file.

- Phil
Title: Re: Really Odd Suggestion - But Could be Helpful for Many
Post by: asjones on October 31, 2019, 02:59:39 PM
I love what you have, I am confused if you are showing the encoding types (ASCII, UTF-8, UTF-16, UTF-32 not really used, EBCDIC ... not used much) etc

What are you counting for words (generic space between stuff or something else? Does that vary with double byte characters in UTF-16)?

I am attaching a zip file of a few text file types I think this will work.
Title: Re: Really Odd Suggestion - But Could be Helpful for Many
Post by: Phil Harvey on October 31, 2019, 03:02:04 PM
Here's what I get.  Note that I'm not checking for UTF-16 yet:

> exiftool --system:all tmp/*
======== tmp/DOS-ASCII.txt
ExifTool Version Number         : 11.75
File Type                       : TXT
File Type Extension             : txt
MIME Type                       : text/plain
Has BOM                         : No
Newline Type                    : Windows (CR/LF)
Line Count                      : 1
Word Count                      : 2
======== tmp/EBCDIC.txt
ExifTool Version Number         : 11.75
File Type                       : TXT
File Type Extension             : txt
MIME Type                       : text/plain
Has BOM                         : No
Newline Type                    : Macintosh (CR)
Line Count                      : 2
Word Count                      : 2
======== tmp/UTF-16.txt
ExifTool Version Number         : 11.75
Error                           : Unknown file type
======== tmp/UTF-8.txt
ExifTool Version Number         : 11.75
File Type                       : TXT
File Type Extension             : txt
MIME Type                       : text/plain
Has BOM                         : No
Newline Type                    : Windows (CR/LF)
Line Count                      : 1
Word Count                      : 2
    4 image files read
Title: Re: Really Odd Suggestion - But Could be Helpful for Many
Post by: asjones on October 31, 2019, 03:56:32 PM
running in a sec, but that seemed to change the dos/unix/mac setting for the ASCII/UTF options

they are unrelated though... i was expecting a dox/unix response and a line for Ascii, UTF-8, UTF16, with/without BOM, etc.



https://superuser.com/questions/301552/how-to-auto-detect-text-file-encoding

https://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding

https://softwareengineering.stackexchange.com/questions/187169/how-to-detect-the-encoding-of-a-file

Title: Re: Really Odd Suggestion - But Could be Helpful for Many
Post by: Phil Harvey on November 01, 2019, 09:28:20 AM
Playing with this a bit more, I now have this:

> exiftool tmp -CharacterSet -linecount -wordcount
======== tmp/DOS-ASCII.txt
Character Set                   : ASCII, Windows CRLF
Line Count                      : 1
Word Count                      : 2
======== tmp/EBCDIC.txt
Character Set                   : Unknown 8-bit
Line Count                      : 2
Word Count                      : 2
======== tmp/UTF-16.txt
Character Set                   : UTF-16LE Unicode
======== tmp/UTF-8.txt
Character Set                   : ASCII, Windows CRLF
Line Count                      : 1
Word Count                      : 2
    1 directories scanned
    4 image files read


Note that I have now added .TXT to the list of supported file extensions, and only calculate the word count for 8-bit character sets.   Your UTF-8 file doesn't have any special characters, so it is in fact plain ASCII.   I'm not planning on supporting EBCDIC.

- Phil
Title: Re: Really Odd Suggestion - But Could be Helpful for Many
Post by: Phil Harvey on November 01, 2019, 11:51:12 AM
...still playing with this.  Here is the current iteration:

> exiftool tmp -mimetype -byteordermark -newline -'*count'
======== tmp/DOS-ASCII.txt
MIME Type                       : text/plain; charset=us-ascii
Newline                         : Windows CRLF
Line Count                      : 1
Word Count                      : 2
======== tmp/EBCDIC.txt
MIME Type                       : text/plain; charset=unknown-8bit
Newline                         : Macintosh CR
Line Count                      : 2
Word Count                      : 2
======== tmp/UTF-16.txt
MIME Type                       : text/plain; charset=utf-16le
Byte Order Mark                 : Yes
Newline                         : Windows CRLF
======== tmp/UTF-8.txt
MIME Type                       : text/plain; charset=us-ascii
Newline                         : Windows CRLF
Line Count                      : 1
Word Count                      : 2
    1 directories scanned
    4 image files read


- Phil
Title: Re: Really Odd Suggestion - But Could be Helpful for Many
Post by: asjones on November 01, 2019, 01:19:04 PM
Looking good

I will see if I can get more samples
Title: Re: Really Odd Suggestion - But Could be Helpful for Many
Post by: Phil Harvey on November 04, 2019, 12:04:58 PM
I've just released ExifTool 11.75, which gives this output for the files you posted:

> exiftool -mime'*' -newlines -'*count' tmp
======== tmp/DOS-ASCII.txt
MIME Type                       : text/plain
MIME Encoding                   : us-ascii
Newlines                        : Windows CRLF
Line Count                      : 1
Word Count                      : 2
======== tmp/EBCDIC.txt
MIME Type                       : text/plain
MIME Encoding                   : unknown-8bit
Newlines                        : Macintosh CR
Line Count                      : 2
Word Count                      : 2
======== tmp/UTF-16.txt
MIME Type                       : text/plain
MIME Encoding                   : utf-16le
Newlines                        : Windows CRLF
======== tmp/UTF-8.txt
MIME Type                       : text/plain
MIME Encoding                   : us-ascii
Newlines                        : Windows CRLF
Line Count                      : 1
Word Count                      : 2
    1 directories scanned


- Phil
Title: Re: Really Odd Suggestion - But Could be Helpful for Many
Post by: asjones on November 04, 2019, 04:15:49 PM


I really love this update, it will keep me more in one tool. However the first "real file (not samples)" I tossed at ExifTool was 305kb, and I knew had some special UTF-8 characters in it. ExifTool said it was ASCII so I did some digging and realized due to how it was sorted in this case the UTF-8 specific characters were at the 2010KB spot. When I editing the junk from the beginning it reported UTUF-8.

So a few quick questions
1. Any chance for -TXT_FILE or -LARGE  parameter to read the whole file and/or read whole file and only read text files? Or could expanding to the whole file be enabled with some other existing option?

2. What encodings are supported US-ASCII, UTF-8, UTF-16, (I assume no UTF-32)? Line ending of Win/DOS, UNIX, Mac (old).  No EBCDIC support.... Anything else detected or that is known but not detected?

3. I saw the text tag documentation, at  https://www.exiftool.org/TagNames/Text.html
However did not see the the 64KB limit or other details listed. If easy might be nice to document.

thanks for a great tool and new features!





Title: Re: Really Odd Suggestion - But Could be Helpful for Many
Post by: Phil Harvey on November 04, 2019, 06:47:00 PM
I can maybe continue to check for non-ascii characters without a big performance hit if scanning the whole file for word/line count.  I'll look into this.

I had various versions of the documentation that mentioned that the encoding and newlines are determined from the first 1 kB (not 64 kB), but they didn't read very well.  I'll see about trying to add this again.  Also, maybe I should expand this limit to more than 1 kB.

It does check for UTF-32, but this and UTF-16 rely on an initial BOM.  So the encodings detected are:

utf-32le
utf-32be
utf-16le
utf-16be
us-ascii
utf-8
iso-8859-1
unknown-8bit

- Phil
Title: Re: Really Odd Suggestion - But Could be Helpful for Many
Post by: asjones on November 04, 2019, 08:15:57 PM
thanks for continuing to think about this. I always end up with the strange things in various systems and procdsses... In my "real" test file the first 700+ lines (200K) did not have a UTF-8 character so it assumed ascii.

if you are looking for line ending hopefully you could also check for characters.

thank for your help

Alan
Title: Re: Really Odd Suggestion - But Could be Helpful for Many
Post by: Phil Harvey on November 05, 2019, 08:50:43 AM
Hi Alan,

How about something like this?:

        Although basic text files contain no metadata, the following tags are
        determined from a simple analysis of the text data.  LineCount and WordCount
        are generated only for 8-bit encodings, but the API FastScan option or
        command-line -fast option may be used to limit processing to the first 64 kB,
        in which case these two tags are not produced.

On top of this, ExifTool will issue a minor warning and process only the first 64 kB of any file larger than 20 MB (to avoid long processing delays), unless the -m option is used.

- Phil
Title: Re: Really Odd Suggestion - But Could be Helpful for Many
Post by: asjones on November 05, 2019, 01:12:45 PM
This sounds great. Part of this sounded like something you are planning on adding to the documentation about the file type with part of the post indented and colored

Wanted to make sure I am on the same page as you.

I love expanding the FastScan to first 64kB. If enabled lines/word count not displayed.

It sounds like ExifTool will process files up to 20 MB and anything larger than 20 MB will be limited to the first 64kB unless -m is used.

I looked up -m and thought that is a great addition to the -m


I hope the docs/tag info at https://www.exiftool.org/TagNames/Text.html
Can include all the blue descriptive text you add plus the text "ExifTool will issue a minor warning and process only the first 64 kB of any file larger than 20 MB (to avoid long processing delays), unless the -m option is used".

This will be an exciting change and helpful for us.

thanks

Alan Jones


Title: Re: Really Odd Suggestion - But Could be Helpful for Many
Post by: Phil Harvey on November 05, 2019, 01:47:36 PM
Hi Alan,

Quote from: asjones on November 05, 2019, 01:12:45 PM
This sounds great. Part of this sounded like something you are planning on adding to the documentation about the file type with part of the post indented and colored

Exactly.

QuoteIt sounds like ExifTool will process files up to 20 MB and anything larger than 20 MB will be limited to the first 64kB unless -m is used.

Yes.

QuoteI hope the docs/tag info at https://www.exiftool.org/TagNames/Text.html
Can include all the blue descriptive text you add plus the text "ExifTool will issue a minor warning and process only the first 64 kB of any file larger than 20 MB (to avoid long processing delays), unless the -m option is used".

Yes, that will be the home of the blue text.  And I'll add the note about the 20 MB limit.

QuoteThis will be an exciting change and helpful for us.

Great.

- Phil