Removing all MetaData from PDF files

Started by craisin, February 21, 2012, 11:29:54 AM

Previous topic - Next topic

craisin

Thats OK Phil.

As a matter of interest for those who may be interested in how I succeeded in getting rid of the Metadata by using exifTool and then Acrobat, I wrote the following code:


Option Compare Database
Option Explicit
'
'This code is VBA code designed to accept filenames from the user (via a pop-up
'File-Dialog where user can select multiple files at the one time. The code then
'processes each file, removing the metadata in the file then saving the file back
'out into either the original file of (optionally) a new file with extension
'".nometa.pdf"
'
'Note! At the moment this code is deigned to work with PDF files and would require
'      changes to be made in the code to accomodate non-PDF files.
'
'Note!:
'This module must have the following references included
'within the VBA Editor environment
'  Visual Basic for Applications
'  Microsoft Access 11.o Object Library  (or equivalent/later)
'  Microsoft Office 14.0 Object Library  (or equivalent/later)
'  Adobe Acrobat 5.0 Type Library        (or equivalent/later)
'Include these by clicking on "Tools/References" within the editor

Public Sub RemoveMetaFromFiles()
Dim AcroPDDoc As Acrobat.CAcroPDDoc
Dim bOverwriteOriginal As Boolean
Dim cExt As String
Dim cFile As String
Dim cFiles As String
Dim cMsg As String
Dim cRoot As String
Dim fs As Object
Dim nEndSize As Long
Dim nFile As Integer
Dim nFiles As Long
Dim nStartSize As Long
Dim oDlg As FileDialog
Dim oFile As Object

'Change the following line to have value "True" if you DO
'want to overwrite the original file rather than writing to a new
'file with extension ".nometa.pdf"
bOverwriteOriginal = False

Set AcroPDDoc = CreateObject("AcroExch.PDDoc")

Set oDlg = FileDialog(msoFileDialogFilePicker)
oDlg.InitialFileName = "*.pdf"
oDlg.AllowMultiSelect = True
oDlg.ButtonName = "Select"
oDlg.Title = "Select files from which MetaData is to be extracted:"
oDlg.Filters.Add "PDF Files", "*.pdf", 1
oDlg.Filters.Add "All Files", "*.*", 2
oDlg.InitialView = msoFileDialogViewDetails
Set fs = CreateObject("Scripting.FileSystemObject")
oDlg.InitialFileName = GetSetting(Application.Name, "Setup", "InitialFileName", "*.pdf")
If oDlg.Show = -1 Then
  cMsg = "MetaData removed:" + vbCrLf + vbCrLf
  nFiles = oDlg.SelectedItems.Count
  For nFile = 1 To nFiles
    SaveSetting Application.Name, "Setup", "InitialFileName", oDlg.InitialFileName
    Call SysCmd(acSysCmdSetStatus, "Removing Metadata: " + CStr(nFile) + "/" + cFiles + ":" + cFile)
    cFile = oDlg.SelectedItems(nFile)
    If fs.FileExists(cFile) Then
      Set oFile = fs.getfile(cFile)
      nStartSize = nStartSize + oFile.Size
      cMsg = cMsg + Space(4) + Format(nStartSize, "###,###,###")
      cExt = LCase(Mid(cFile, InStrRev(cFile, ".")))
      cRoot = Replace(cFile, cExt, "")
     
      StripMeta cFile, cExt, bOverwriteOriginal
     
      nEndSize = nEndSize + oFile.Size
      Select Case cExt
        Case ".pdf"
          'Open the newly stripped PDF file
          If AcroPDDoc.Open(cRoot + ".nometa.pdf") Then
            'then  save it removing unreferenced objects
            AcroPDDoc.Save PDSaveCollectGarbage + PDSaveFull, cRoot + ".nometa.pdf"
            AcroPDDoc.Close
          End If
      End Select
      cMsg = cMsg + " => " + Format(nEndSize, "###,###,###") + ": " + cRoot + ".nometa.pdf" + vbCrLf
    End If
  Next
  Call SysCmd(acSysCmdClearStatus)
  Set AcroPDDoc = Nothing
  MsgBox cMsg, vbInformation + vbOKOnly
End If
End Sub
'
'

Private Sub StripMeta(cFile As String, cExt As String, bOverwriteOriginal As Boolean)
   Dim cExitTool As String
   Dim cOutFile As String
   
   'Note!: Comment out/amend the following to point to the exifTool Utility on your system
   cExifTool = "D:\exifTool\exifTool.exe"
   'cExifTool = "c:\Program Files\exifTool\exiftool.exe"
   
   Select Case cExt
     Case ".pdf"
       'The following lines remove the Metadata from the document dictionary only
       If bOverwriteOriginal Then
          Shell cExifTool + " -all= " + cFile, vbMinimizedFocus
       Else
          cOutFile = Replace(cFile, cExt, "") + ".nometa.pdf"
          Shell cExifTool + " -all= -o " + cOutFile + " " + cFile, vbMinimizedFocus
       End If
   End Select
End Sub


Phil Harvey

Hi Chris,

Great.  Thanks for the code.

Out of interest, is the resulting PDF linearized?  (ExifTool will report "Linearized: Yes" if it is.)

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

d4jna

Quote from: craisin on February 23, 2012, 11:10:54 AM
Thats OK Phil.

As a matter of interest for those who may be interested in how I succeeded in getting rid of the Metadata by using exifTool and then Acrobat, I wrote the following code:


Option Compare Database
Option Explicit
'
'This code is VBA code designed to accept filenames from the user (via a pop-up
'File-Dialog where user can select multiple files at the one time. The code then
'processes each file, removing the metadata in the file then saving the file back
'out into either the original file of (optionally) a new file with extension
'".nometa.pdf"
'
'Note! At the moment this code is deigned to work with PDF files and would require
'      changes to be made in the code to accomodate non-PDF files.
'
'Note!:
'This module must have the following references included
'within the VBA Editor environment
'  Visual Basic for Applications
'  Microsoft Access 11.o Object Library  (or equivalent/later)
'  Microsoft Office 14.0 Object Library  (or equivalent/later)
'  Adobe Acrobat 5.0 Type Library        (or equivalent/later)
'Include these by clicking on "Tools/References" within the editor

Public Sub RemoveMetaFromFiles()
Dim AcroPDDoc As Acrobat.CAcroPDDoc
Dim bOverwriteOriginal As Boolean
Dim cExt As String
Dim cFile As String
Dim cFiles As String
Dim cMsg As String
Dim cRoot As String
Dim fs As Object
Dim nEndSize As Long
Dim nFile As Integer
Dim nFiles As Long
Dim nStartSize As Long
Dim oDlg As FileDialog
Dim oFile As Object

'Change the following line to have value "True" if you DO
'want to overwrite the original file rather than writing to a new
'file with extension ".nometa.pdf"
bOverwriteOriginal = False

Set AcroPDDoc = CreateObject("AcroExch.PDDoc")

Set oDlg = FileDialog(msoFileDialogFilePicker)
oDlg.InitialFileName = "*.pdf"
oDlg.AllowMultiSelect = True
oDlg.ButtonName = "Select"
oDlg.Title = "Select files from which MetaData is to be extracted:"
oDlg.Filters.Add "PDF Files", "*.pdf", 1
oDlg.Filters.Add "All Files", "*.*", 2
oDlg.InitialView = msoFileDialogViewDetails
Set fs = CreateObject("Scripting.FileSystemObject")
oDlg.InitialFileName = GetSetting(Application.Name, "Setup", "InitialFileName", "*.pdf")
If oDlg.Show = -1 Then
  cMsg = "MetaData removed:" + vbCrLf + vbCrLf
  nFiles = oDlg.SelectedItems.Count
  For nFile = 1 To nFiles
    SaveSetting Application.Name, "Setup", "InitialFileName", oDlg.InitialFileName
    Call SysCmd(acSysCmdSetStatus, "Removing Metadata: " + CStr(nFile) + "/" + cFiles + ":" + cFile)
    cFile = oDlg.SelectedItems(nFile)
    If fs.FileExists(cFile) Then
      Set oFile = fs.getfile(cFile)
      nStartSize = nStartSize + oFile.Size
      cMsg = cMsg + Space(4) + Format(nStartSize, "###,###,###")
      cExt = LCase(Mid(cFile, InStrRev(cFile, ".")))
      cRoot = Replace(cFile, cExt, "")
     
      StripMeta cFile, cExt, bOverwriteOriginal
     
      nEndSize = nEndSize + oFile.Size
      Select Case cExt
        Case ".pdf"
          'Open the newly stripped PDF file
          If AcroPDDoc.Open(cRoot + ".nometa.pdf") Then
            'then  save it removing unreferenced objects
            AcroPDDoc.Save PDSaveCollectGarbage + PDSaveFull, cRoot + ".nometa.pdf"
            AcroPDDoc.Close
          End If
      End Select
      cMsg = cMsg + " => " + Format(nEndSize, "###,###,###") + ": " + cRoot + ".nometa.pdf" + vbCrLf
    End If
  Next
  Call SysCmd(acSysCmdClearStatus)
  Set AcroPDDoc = Nothing
  MsgBox cMsg, vbInformation + vbOKOnly
End If
End Sub
'
'

Private Sub StripMeta(cFile As String, cExt As String, bOverwriteOriginal As Boolean)
   Dim cExitTool As String
   Dim cOutFile As String
   
   'Note!: Comment out/amend the following to point to the exifTool Utility on your system
   cExifTool = "D:\exifTool\exifTool.exe"
   'cExifTool = "c:\Program Files\exifTool\exiftool.exe"
   
   Select Case cExt
     Case ".pdf"
       'The following lines remove the Metadata from the document dictionary only
       If bOverwriteOriginal Then
          Shell cExifTool + " -all= " + cFile, vbMinimizedFocus
       Else
          cOutFile = Replace(cFile, cExt, "") + ".nometa.pdf"
          Shell cExifTool + " -all= -o " + cOutFile + " " + cFile, vbMinimizedFocus
       End If
   End Select
End Sub



Thank you so much. Just what I needed. Works brilliantly.

craisin

No worries....it was written so long ago i forgot all about it!

(I just LOVE programming, don't you? LOL   :)

metaclean

Hello,
sorry for post into this old thread, but I think I have found a easy way to clean metadata permanently.
Just uncompress and compress the pdf with pdftk.


pdftk file.pdf output file.pdf.tmp uncompress
exiftool -all= file.pdf.tmp
pdftk file.pdf.tmp output file.pdf compress


exiftool -pdf-update:all file.pdf works till you compress the file. If you uncompress it again, you however can't recover the metadata too.
I checked the file with a hexeditor (compressed and uncompressed) and can't find any metadata or xmp stream.
Maybe this is a easy and safe way to clean metadata? Any suggestions?

Greetings
(sorry for my english, I'm not a native english speaker.)

PH Edit: Struck out the first command (it is not necessary, see later post)

Phil Harvey

Interesting.  But is the uncompress step really necessary?

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

craisin

Thanks for your comment Phil.

The advantage of the code I provided is that the original is retained (with the metadata) whould it be required for another purpose.
Also it is a one line run of code.

Still, every approach is interesting.

Nice to see every successful resolution.  :-)

metaclean

Quote from: Phil Harvey on March 20, 2014, 07:14:54 PM
But is the uncompress step really necessary?

After some tests I can say that the first command is obsolete and just compress at the end is enough.

IWTA

#23
Hi Phil!
It was an unpleasant surprise for me to find out that after the "DELETE" function of metadata from the PDF file, they actually did not delete. Not removed from the word "completely." :(
Why doesn't ExifTool remove all metadata from the PDF format, but just hide it? After all, anyway, metadata is available for viewing through the simplest text editor. What is the point of this action? If we are talking about "DELETING" metadata, then why do we mean "HIDE" metadata?
There is a need to completely remove the metadata for the PDF file, but if I understood correctly, is it impossible to do this in ExifTool? Or I'm wrong?

Phil Harvey

I don't know what else I can say other than what has already been said in this topic.

- Phil
...where DIR is the name of a directory/folder containing the images.  On Mac/Linux/PowerShell, use single quotes (') instead of double quotes (") around arguments containing a dollar sign ($).

StarGeek

Quote from: IWTA on August 10, 2019, 04:35:47 AM
There is a need to completely remove the metadata for the PDF file, but if I understood correctly, is it impossible to do this in ExifTool? Or I'm wrong?

As mention elsewhere in this thread, it's not possible to remove metadata completely with exiftool.  Make sure you read this whole thread and the leading paragraphs on the PDF Tags page, especially note #2.
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).

IWTA

Phil
The reason for the current implementation of the "removal" (hiding) of metadata from PDF files is the complexity of implementing a complete deletion? Or was it all originally conceived? Here is what I wanted to clarify.

StarGeek

Previous post by Phil on the subject
Quote from: Phil Harvey on November 18, 2012, 11:43:33 AM
There isn't much chance that I will add a permanent delete feature to ExifTool because the PDF structure is very complicated and doing this would be a lot of work.  Bascially, the only reason I was able to add a write feature for PDF at all is because I was able to do an incremental update (which avoids the problem of having to rewrite the entire file).
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).

IWTA

Thanks StarGeek
I have not seen this message. Now everything is clear to me!

StarGeek

I knew it existed, but it took me far too long to find it.  To the point it became less about quoting and more of a search and destroy mission.  :D
* Did you read FAQ #3 and use the command listed there?
* Please use the Code button for exiftool code/output.
 
* Please include your OS, Exiftool version, and type of file you're processing (MP4, JPG, etc).