Dealing with Calibre metadata in PDF files

Started by Jonz, July 27, 2014, 04:03:20 PM

Previous topic - Next topic

Jonz

I am having problem dealing with Calibre http://calibre-ebook.com/ (library management program) metadata.

Calibre keeps its files in a structured directory system, and I have a lot of them. On my system, each branch has three files: the library file (usually a pdf, doc, or excel file), a xml file (which is called metadata.opf), and a cover.jpg file.

I want to take tags from the opf file and write them back to the pdf file, and do this recursively through the whole directory system. The tags I'd like to write back are, at minimum, the keywords, the title, and the author.

Is this possible using ExifTool? Thanks!


Here is a sample opf file:

<?xml version='1.0' encoding='utf-8'?>
<package xmlns="http://www.idpf.org/2007/opf" unique-identifier="uuid_id">
    <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
        <dc:identifier opf:scheme="calibre" id="calibre_id">20</dc:identifier>
        <dc:identifier opf:scheme="uuid" id="uuid_id">5fdcd98f-656e-45a4-ad9e-1938a2fb253b</dc:identifier>
        <dc:title>History of Woodstock, Vermont</dc:title>
        <dc:creator opf:file-as="Henry Swan Dana" opf:role="aut">Henry Swan Dana</dc:creator>
        <dc:contributor opf:file-as="calibre" opf:role="bkp">calibre (0.9.6) [http://calibre-ebook.com]</dc:contributor>
        <dc:date>0101-01-01T00:00:00+00:00</dc:date>
        <dc:language>en</dc:language>
        <dc:subject>Woodstock</dc:subject>
        <dc:subject>Vermont</dc:subject>
        <meta content="{&quot;Henry Swan Dana&quot;: &quot;&quot;}" name="calibre:author_link_map"/>
        <meta content="0" name="calibre:rating"/>
        <meta content="2013-01-28T13:31:29+00:00" name="calibre:timestamp"/>
        <meta content="History of Woodstock, Vermont" name="calibre:title_sort"/>
    </metadata>
    <guide>
        <reference href="cover.jpg" type="cover" title="Cover"/>
    </guide>
</package>

Phil Harvey

This isn't too difficult as long as the files have the same name.  I don't know the directory hierarchy, but assuming the PDF's are rooted in a directory called "pdf", and the OPF's are in "opf", then the command would be something like this:

exiftool -addtagsfromfile opf/%:2d%f.opf -@ opf2xmp.args -ext pdf -r .

The %:2d removes the "./pdf" from the PDF directory path.  The "opf2xmp.args" file would look something like this:

-xmp:Title<PackageMetadataTitle
-xmp:Creator<PackageMetadataCreator
-xmp:CreateDate<PackageMetadataDate
-xmp:Subject<PackageMetadataSubject
...


Note that I used -addtagsfromfile instead of -tagsfromfile in the command so that multiple PackageMetadataSubject tags will be accumulated into xmp:Subject. 

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

Jonz

Phil - thank you for your reply. Unfortunately, the pdf and the xmp files do have different names. The opf file name stays contstant (metadata.opf) but the Acrobat file name varies. If I batch rename the Acrobat files then I have many metadata.pdf files, which doesn't work of course. Probably I could use a batch file to rename the metadata.opf files but I don't know how to do that since it's not straightforward - they are all in different sub-directories and would need to pick up the pdf name.

So I think for now I'm going to try and sort out why Calibre doesn't work for me writing out pdfs with updated metadata. If I can get it to do that consistently then perhaps I can import the pdfs into IMatch and at least I have most of my data.

I use IMatch a lot and am in the process of upgrading to version 5, so in an indirect sort of way I use ExifTool frequently. It's obvious that it's the go-to utility for this work, it's just that I've always wimped out and used one gui or another. Time to bite the bullet? Thanks very much for the help.

Jonz

This is more information about how Calibre works from Mario Westphal, the author of IMatch.

QuoteI have downloaded the sample, thanks.

I looked at the Hahnemuhle folder as an example. I renamed the OPF file to XMP.
ExifTool can read that file and extract the metadata.

The problem is, the file only contains the proprietary "Package Metadata" namespace, but ExifTool cannot copy/write this data. This is required because IMatch produces a temporary XML output file from the XMP metadata and/or the XMP metadata in the original file, plus data generated by merging IPTC/EXIF/GPS/PDF metadata. For the OPF file this does not work because ExifTool tells me "no writable tags found".

So even if I rename the OPF to XMP and configure IMatch to force/favor XMP data in external sidecar files for PDF files, the data cannot be imported because IMatch never gets to see it. This only Phil can solve.

I wonder they Calibre uses external XMP files instead of embedding the XMP data in the PDF as it is standard.

Also, with the PDF file for Hahnemühle: when IMatch asks Windows to produce a thumbnail for the PDF file, the Adobe Acrobat component installed with the latest Acrobat Reader goes totally nuts. It needs over 600 MB and almost one minute to produce a preview for that file. This is a bad performance hit for IMatch because it can only process the files as fast as it can pull the previews.

I'm quoting it for convenience but the whole thread is here: http://community.photoolsweb.com/index.php?topic=2975.0

Would it be possible to add the capability of reading what I understand to be the proprietary "Package Metadata" namespace that Calibre uses? I'm also posting with this message the zip file [it's same file I sent to Mario Westphal that contains the sample library from Calibre].

Thanks very much.

Phil Harvey

Quote from: Jonz on July 28, 2014, 10:35:17 AM
Phil - thank you for your reply. Unfortunately, the pdf and the xmp files do have different names. The opf file name stays contstant (metadata.opf) but the Acrobat file name varies.

OK, so this limits it to one OPF per directory, and the command changes to this:

exiftool -addtagsfromfile opf/%:2dmetadata.opf -@ opf2xmp.args -ext pdf -r .

ExifTool already reads the OPF file, but as Mario points out into a different namespace.  You deal with this by translating the tag names when copying as in the example opf2xmp.args file I gave.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).