Dealing with 'informational' risk penetration test findings, one at a time...
PDF files carry with them a small amount of metadata that in the grand scheme of things doesn't matter too much, but could possibly give up more information that you really want them to.
Something like exiftool can extract this metadata easily enough.
Removing that metadata from a PDF file can be as easy as something like this:
#!/usr/bin/env python
import os
from pdfrw import PdfReader, PdfWriter # pip install pdfrw
destination_directory = 'clean'
directory_content = os.listdir(os.getcwd())
if destination_directory not in directory_content:
os.mkdir(destination_directory)
for item in directory_content:
if item.endswith('.pdf'):
print('+ Stripping metadata from file: {}'.format(item))
pdf = PdfReader(item)
for metadata in pdf.Info:
del pdf.Info[metadata]
PdfWriter('{0}/{1}'.format(destination_directory, item), trailer=pdf).write()
Rerunning exiftool against the new .pdf shows that the metadata fields no longer exist.
For bonus points, make sure you're not redacting sensitive content by overlaying removable black rectangles.