Scrub Word Documents
Feb 8th, 2008 by seanh420

I was asked the other day to write a program to scrub some 50,000 Microsoft Word 2003 documents of hidden data. I was surprised at first because I had no idea that Word secretly tracked changes in documents and who changed what. Apparently this problem is dangerous to companies in particular. Imagine writing in a Word document something that says, "Mike is an asshole." Then later you realize Mike might see this document so you delete it. If Mike finds this document and he knows what he is doing and you don’t know about this, Mike can see that you wrote this and deleted it. It’s logged, but hidden. I looked on google for about an hour but came up with a lot of crap I didn’t like. I thought I had a winner when I landed on Doc Scrubber, but all that did was remove document properties. Then I landed on Kim Komando’s page which suggested downloading something called the RHD tool from Microsoft (Validation Required). Upon further inspection of this tool, I found it to be a pain in the ass for the purpose of scrubbing multiple documents. I kept getting the error, "Protected Document failed. Reason: This document contains protected information. If you are not the author of this document, contact the author to obtain permissions." whenever I ran the OFFRHD.EXE program located in the RHD thing. I grew impatient with these weak methods and decided to write my own program. The concept is simple: open a document, select all, copy, close the document, create a new document, paste, save the new document in a different folder. I noticed that by doing this, some documents decreased quite a bit in size. On one test of a document that was edited by dozens of people, this method shrunk it from 3.03 MB to 2.01 MB. They seemed happy with it, so I just used VB6 and Word 2003 to automate this on a lot of documents. Mind you, I had to spend some time learning how to automate office because I never did this before. Why the hell does Word put all this bullshit in documents anyway? My program is called, "Word 2003 Document Bullshit removal." I welcome any and all comments or suggestions. This is not meant to be a commercial end all solution but rather something for other programmers to look at. Use at your own risk.
Example Document before scrubbing

I had to sensor the screenshot because too much data was exposed.
Example Document after scrubbing with my code

My Code:
Private Sub cmdRemove_Click()
Me.MousePointer = 11
Dim WordApp As Word.Application
Dim source As String, dest As String, dirtydocument As String
Dim i As Integer
Set WordApp = CreateObject("Word.Application")
source = txtSource.Text ‘ path to original documents
dest = txtDestination.Text ‘ path to new scrubbed docs
‘ make sure path ends in \ (eg. c:\temp\ )
If Not (Right(source, 1) = "\") Then
source = source & "\"
End If
If Not (Right(dest, 1) = "\") Then
dest = dest & "\"
End If
‘ delete all files in destination folder
Dim File, Folder, FileCollection
Dim fso
Set fso = CreateObject("Scripting.FileSystemObject")
Set Folder = fso.GetFolder(dest)
Set FileCollection = Folder.Files
For Each File In FileCollection
fso.DeleteFile (File)
Next
‘ fill file listbox with source folder
File1.Path = source
‘ Set the Visible flag
WordApp.Visible = True
For i = 1 To File1.ListCount
dirtydocument = File1.List(i - 1)
‘ open the document to be scrubbed clean
‘ (eg. c:\temp\1040.doc )
WordApp.Documents.Open (source & dirtydocument)
‘ copy all the contents
WordApp.ActiveDocument.Content.Copy
‘ close the document
WordApp.ActiveDocument.Close
‘ create a new document
WordApp.Documents.Add
‘ paste the contents
WordApp.ActiveDocument.Content.Paste
‘ save as the new document using the same name in a different folder
‘ (eg. c:\temp\scrubbed\1040.doc) <— file size should may be less
WordApp.ActiveDocument.SaveAs (dest & dirtydocument)
‘ clean up document properties
‘ for more, look up WdBuiltInProperty constants
WordApp.ActiveDocument.BuiltInDocumentProperties(wdPropertyAuthor) = ""
WordApp.ActiveDocument.BuiltInDocumentProperties(wdPropertyCompany) = ""
‘ close the ’scrubbed’ version of the document
WordApp.ActiveDocument.Close
Next i
‘ * BUG ***********************
‘ * I get some "You placed a large amount of text on the Clipboard." message
‘ * dont know how to get rid of it, so I’m just leaving word open.
‘Call EmptyClipboard
‘ exit word
‘WordApp.Quit
Me.MousePointer = 0
MsgBox "Finished removing bullshit from all Word 2003 documents!"
End Sub
Let me know what you think. It’s Friday and I’m going home.
Download the source code and program here
anyway you can get this to work with excel?
Does this work with Access?
I only wrote this for word 2003. Its a pretty stupid program really and I only did it cuz I was paid to do it. I have no intentions of writing something for excel/access/ or others…. that is unless someone pays me to.