Embedded MS Word data consers

Started by bertalanimre, February 02, 2017, 06:37:28 AM

Previous topic - Next topic

bertalanimre

Hey Forum,

I've just noticed after cleaning my files with the combo of exiftool, pdftk and qpdf, I still have some very sensitive metadata in the embedded objects. 98% of these objects are MS Word and Excel files and they conain their information about Filename, DocumentID, InstanceID. Do you know how to remove these informations as well? I'm a bit affraid because I remember, when I've tried to remove metadata from plain word documents, it crashed the file and it could not be read/used again. Hope this is not the same situation.

Phil Harvey

I can't answer this.  ExifTool doesn't write MS Word or Excel files.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

bertalanimre

I know Phil. I'm hoping in the community if anyone else came across the same issue. :)

bertalanimre

Well, what I can state is that Adobe Reader can optimize the PDF and it also removes the mentioned data without harming the table of content. I wonder what library does it use and if there is a unix alternative for it. Anyone knows maybe?

bertalanimre

Wuhu, found it!

It was GhostScript which helped me out. Altho it is not as good as ExifTool with PDF-s, but it does remove the embedded metadata without harming the table of contents. After this, I can simply remove all the metadata with exiftool and make it permanent with a qpdf linearization.

Phil: If you are intrested I can give the whole process to you in a bash script file I'm working on at the moment. It is not a big deal but might give you some good thoughts. :)

Phil Harvey

...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).