Removing metadata from PDF files

Dealing with 'informational' risk penetration test findings, one at a time...

PDF files carry with them a small amount of metadata that in the grand scheme of things doesn't matter too much, but could possibly give up more information that you really want them to.

Something like exiftool can extract this metadata easily enough. PDF metadata example

Removing that metadata from a PDF file can be as easy as something like this:

#!/usr/bin/env python

import os
from pdfrw import PdfReader, PdfWriter  # pip install pdfrw

destination_directory = 'clean'
directory_content = os.listdir(os.getcwd())

if destination_directory not in directory_content:
os.mkdir(destination_directory)

for item in directory_content:
    if item.endswith('.pdf'):
        print('+ Stripping metadata from file: {}'.format(item))
        pdf = PdfReader(item)
        for metadata in pdf.Info:
            del pdf.Info[metadata]
        PdfWriter('{0}/{1}'.format(destination_directory, item), trailer=pdf).write()

Rerunning exiftool against the new .pdf shows that the metadata fields no longer exist.

For bonus points, make sure you're not redacting sensitive content by overlaying removable black rectangles.

Posted on: Fri 12 October 2018

Category: security – Tags: security