如何獲得Python中的兩個PDF文件的差異？

pyPdf在我的測試中不是很健壯。它在由Illustrator/InDesign和其他矢量繪圖程序創建的pdf上崩潰。不過，對於來自Office應用程序的簡單PDF文件可能沒問題。另一個更可靠的選擇是來自xpdf工具包的pdftotext。 – fbuchinger 2009-08-21 09:34:49

我不知道你的使用情況，但對於腳本生成的PDF使用ReportLab的迴歸測試，我通過

轉換做差異PDF文件每一頁的圖像使用ghostsript
版本比較針對標準的PDF的頁面圖像的各頁，使用PIL

例如

im1 = Image.open(imagePath1) 
im2 = Image.open(imagePath2) 

imDiff = ImageChops.difference(im1, im2)

這適用於標記由於代碼更改而引入的任何更改。

來源

2009-08-21 10:17:24

第1步是否有一些參考？ – yucer 2016-09-09 07:12:34

在我的加密的pdf unittest上遇到了同樣的問題，pdfminer和pyPdf都不適合我。

這裏有兩個命令（pdftocairo，pdftotext）在我的測試中很完美。（Ubuntu的安裝：apt-get的安裝poppler的-utils的）

你可以通過PDF內容：

from subprocess import Popen, PIPE 

def get_formatted_content(pdf_content): 
    cmd = 'pdftocairo -pdf - -' # you can replace "pdftocairo -pdf" with "pdftotext" if you want to get diff info 
    ps = Popen(cmd, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE) 
    stdout, stderr = ps.communicate(input=pdf_content) 
    if ps.returncode != 0: 
     raise OSError(ps.returncode, cmd, stderr) 
    return stdout

好像pdftocairo可以重繪PDF文件，pdftotext可以提取所有文本。

然後你就可以比較兩個PDF文件：

c1 = get_formatted_content(open('f1.pdf').read()) 
c2 = get_formatted_content(open('f2.pdf').read()) 
print(cmp(c1, c2)) # for binary compare 
# import difflib 
# print(list(difflib.unified_diff(c1, c2))) # for text compare

來源

2014-02-11 03:14:26 gzerone

如何獲得Python中的兩個PDF文件的差異？

回答

相關問題