News:

2023-03-15 Major improvements to the new Geolocation feature

Main Menu

Analysis of PDF Modifications

Started by roygbiv, August 02, 2016, 03:21:11 PM

Previous topic - Next topic

roygbiv

Am almost too embarrassed to admit that I am not quite a Luddite but might as well be in these hallowed chambers.  I came on here in a vain attempt to find the correct commands for exiftool to extract details of all modifications that have been made to a pdf file (text only, no images).  I have next to zero programming skills so its a little like learning to drive with one eye, one hand, one leg, in a foreign country, with no fuel. 

I should add that I am only trying to understand the history of the file (and what may have been added or removed within the legacy data), rather than edit or modify a pdf file myself.  I require information about changes to specific text within a table rather than the overall document parameters itself.

The only thing that I have previously learnt to do that might be helpful now is . . . . ask.  Is anyone prepared to help?

Hayo Baan

Sure, we are all here to help :)

I am not sure that the modification history will be determinable from every PDF (whether or not that is written depends on the software used, of course), but we can certainly try.

Have you already run exiftool on one of your files? With only the file as parameter it will give you all information it can gather, e.g., try this first: exiftool FILEorDIR. You have to run this on the command line and replace FILEorDIR with the full name of the file (or directory) you want to test.

Let us know if you have more questions.
Hayo Baan – Photography
Web: www.hayobaan.nl

roygbiv

Hello Hayo.  Thanks for the prompt assistance.

Having seen a few youtube postings, I've been able to rename the .exe file to include some of the commands, and simply drag the pdf file over the .exe to gain the analysis.  I've played with various.  My problem is that most of the online help videos and blogs refer to jpeg files.  I've been unable to find any that tell me the correct commands (?) for the detail of the pdf metadata.

I started with copying one suggestion: exiftool(-a -u -g1 -w txt) and have attempted half a dozen variations with mixed results.  Having read Phil's "manual" I have drawn the conclusion that it is probably a masterpiece, just as a picasso is to an art lover, but if you're not an art lover you might not get it.  I don't get the jargon (and why should I, its written for professionals?).


Phil Harvey

There are 2 commands I would try.

1. the one you have already tried: exiftool(-a -u -g1 -w txt).exe

2. the verbose output to see the full structure of the PDF: exiftool(-v3 -w _verbose.txt).exe

The second command will produce files with names like "FILE_verbose.txt".  Look through the output file for anything that looks like a revision history.  Honestly I don't know if there could be one, but if there is the verbose output should show it.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

roygbiv

#4
 :D I salute the man, himself.  Many thanks for being here . . . there . . .virtually there!

The output of your "verbose" suggestion looks far more promising. 

roygbiv

#5
Having identified that the sourcefile was modified, what instruction will access the original data (i.e. text) that may have been amended which I understand might still be stored in an xmp (?) within the document (although I confess that I may have misunderstood)? 

Phil Harvey

Are you talking about document text?  XMP is for metadata, not document data.  If the PDF was edited with an incremental update then the original data will exist, but ExifTool deals only with metadata and can't be used to access the document data.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

roygbiv

Phil

Thanks, again, for the prompt response.  Your simple question and statement of fact have accelerated the personal learning curve significantly. 

It appears that I've not been joining the dots up correctly.  Somewhere during the recent trawling for information, I have read several references to text being retained within pdf documents after modification (erasure) albeit "hidden".  Somehow, after experimenting with a number of the commands within your manual in a not very scientific way, I thought I had successfully generated a text output file that has extracted all of the final text (I certainly have such a file).  You are now causing me to question how that was actually achieved, and I honestly don't now know.  Prior to your last message, I have been trying to find the routes to the "archived" text, which I had (wrongly) thought might be in an xmp file embedded within the document.

I sincerely apologise for wasting your time.

But thanks, again in any event.  I've almost enjoyed myself! ;)

For what its worth, I've found your forum very useful and helpful despite not being on an equal footing with the experienced heads around the room.  Having watched Jason Bourne last evening, I suspect you might be funded by the CIA after all . . .

Cheers, Phil and good luck in your endeavours. 

roygbiv

 . . . . or did your point go right over my head?  What I should be looking for is the metadata that indicates what data was modified?

Phil Harvey

Typically the metadata may contain things like the date/time when the document was modified, but not the details about what was modified.

- Phil

P.S. I can neither deny nor confirm allegations of CIA funding
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).