idea: gettings tags from html files

Started by kapouer, January 30, 2016, 10:56:19 AM

Previous topic - Next topic

kapouer

Every html file has a <title> tag, and many online files have tags to ease link inspection,
meta tags used to be useless (keywords and description have clearly been abused) but
schema.org, opengraph, twitter card, oembed all help at representing a web page with
a title and a thumbnail.
See https://github.com/kapouer/url-inspector/blob/master/lib/inspector.js#L383
for a simple example of what could be interesting to return as tags.

Phil Harvey

Have you tried running ExifTool on an html file?  It should return tags from the header section, including Title.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

pjux

Hello, I'm using exiftool to conduct analysis of metadata as OP described. While exiftool will pull metadata for HTML files, I noticed that it is a bit particular in what it will pull.

For example,

on a page with the following tags in the head, exiftool will not pull the metadata that is presented as meta property=


<meta property="og:type" content="article" />
<meta property="og:site_name" content="HHS.gov" />
<meta property="og:url" content="http://www.hhs.gov/about/budget/fy2017/fy2015-summary-of-performance/goal-1/index.html" />
<meta property="og:title" content="FY 2015 Summary of Performance – Goal 1" />
<meta property="og:description" content="FY 2015 Summary of Performance – Goal One: Strengthen Health Care" />
<meta property="og:updated_time" content="2016-02-22 00:00:00" />
<meta name="dcterms.creator" content="Office of Budget (OB), Assistant Secretary for Financial Resources (ASFR)" />
<meta name="dcterms.title" content="FY 2015 Summary of Performance – Goal 1" />
<meta name="dcterms.description" content="FY 2015 Summary of Performance – Goal One: Strengthen Health Care" />
<meta name="dcterms.date" content="2016-02-19 11:45:00" />
<meta name="dcterms.modified" content="2016-02-22 00:00:00" />
<meta name="dcterms.type" content="Text" />
<meta name="dcterms.format" content="text/html" />


It would be awesome if it could.

Exiftool is an amazing program and I really have found it useful.