從pdf解析註釋

我想要一個使用pdf並返回文檔中註釋註釋文本列表的Python函數。我曾看過python-poppler（https://code.launchpad.net/~poppler-python/poppler-python/trunk），但我無法弄清楚如何讓它給我任何有用的東西。從pdf解析註釋

我找到了get_annot_mapping方法並修改了提供的演示程序，通過self.current_page.get_annot_mapping()調用它，但我不知道如何處理AnnotMapping對象。它似乎沒有完全實現，只提供了複製方法。

如果還有其他的庫提供這個功能，那也沒關係。

來源

2009-07-09 davidb

原來的綁定是不完整的。它現在已經修復。 https://bugs.launchpad.net/poppler-python/+bug/397850

來源

2009-07-12 20:57:11 davidb

我從未使用過這種功能，也不想使用這種功能，但是我發現了PDFMiner - 此鏈接有關於基本用法的信息，也許這就是您要查找的內容？

來源

2009-07-10 05:50:55 zeroDivisible

雖然這可能是有用的，如果我想從PDF中提取所有文本，我只想提取註釋。我之所以提到poppler的原因是因爲它的確提供了這個功能，很容易（http://cgit.freedesktop.org/poppler/poppler/tree/glib/poppler-annot.h）。但是，我想用python。我找到了python-poppler綁定項目，但似乎並沒有提供對註釋的完全訪問。我的問題歸結爲「我做錯了還是圖書館不完整？」和「有沒有其他人提供相同的功能？」 – davidb 2009-07-10 13:54:08

以防萬一有人正在尋找一些工作代碼。這是我使用的腳本。

import poppler 
import sys 
import urllib 
import os 

def main(): 
    input_filename = sys.argv[1] 
    # http://blog.hartwork.org/?p=612 
    document = poppler.document_new_from_file('file://%s' % \ 
    urllib.pathname2url(os.path.abspath(input_filename)), None) 
    n_pages = document.get_n_pages() 
    all_annots = 0 

    for i in range(n_pages): 
     page = document.get_page(i) 
     annot_mappings = page.get_annot_mapping() 
     num_annots = len(annot_mappings) 
     if num_annots > 0: 
      for annot_mapping in annot_mappings: 
       if annot_mapping.annot.get_annot_type().value_name != 'POPPLER_ANNOT_LINK': 
        all_annots += 1 
        print 'page: {0:3}, {1:10}, type: {2:10}, content: {3}'.format(i+1, annot_mapping.annot.get_modified(), annot_mapping.annot.get_annot_type().value_nick, annot_mapping.annot.get_contents()) 

    if all_annots > 0: 
    print str(all_annots) + " annotation(s) found" 
    else: 
    print "no annotations found" 

if __name__ == "__main__": 
    main()

來源

2012-09-19 20:40:13

可能值得在某處公開的git repo上進行夾取，以便其他人可以輕鬆地幫助改進它。 – naught101 2017-08-29 03:09:56

有人問了similar question。我嘗試了那裏的代碼示例，直到我進行了一些功能和外觀更改之後，它才適用於我。

#!/usr/bin/ruby 

require 'pdf-reader' 

ARGV.each do |filename| 
    PDF::Reader.open(filename) do |reader| 
    puts "file: #{filename}" 
    puts "page\tcomment" 
    reader.pages.each do |page| 
     annots_ref = page.attributes[:Annots] 
     if annots_ref 
     actual_annots = annots_ref.map { |a| reader.objects[a] } 
     actual_annots.each do |actual_annot| 
      unless actual_annot[:Contents].nil? 
      puts "#{page.number}\t#{actual_annot[:Contents]}" 
      end 
     end 
     end 
    end  
    end 
end

如果保存爲pdfannot.rb，chmod +x「版，並放入自己喜歡的PATH目錄，用法是：

./pdfannot.rb <path>

第一次寫入/編輯/混音Ruby代碼，所以非常開放的建議。 HTH。

在旁註中，前面找到這個問題可以讓我從雙重工作中解脫出來。希望這個問題在將來得到更多關注，以便更容易找到。

來源

2018-01-14 22:25:14 creativecoding

從pdf解析註釋

回答

相關問題