2014-04-06 267 views
12

我想從PDF文件中提取所有文本框和文本框座標。如何從pdf文件中提取文本和文本座標?

許多其他StackOverflow文章解決了試圖以有序方式提取所有文本的各種解決方案,但花了相當長的一段時間才弄清楚如何執行獲取文本和文本位置的中間步驟。

所以一旦我找到它,我認爲這將是值得在這裏發佈。給定一個pdf文件,輸出應該如下所示:

489, 41, "Signature" 
    500, 52, "b" 
    630, 202, "a_g_i_r" 

回答

20

換行符在最終輸出中轉換爲下劃線。這是我發現的最小工作解決方案。

from pdfminer.pdfparser import PDFParser 
from pdfminer.pdfdocument import PDFDocument 
from pdfminer.pdfpage import PDFPage 
from pdfminer.pdfpage import PDFTextExtractionNotAllowed 
from pdfminer.pdfinterp import PDFResourceManager 
from pdfminer.pdfinterp import PDFPageInterpreter 
from pdfminer.pdfdevice import PDFDevice 
from pdfminer.layout import LAParams 
from pdfminer.converter import PDFPageAggregator 
import pdfminer 

# Open a PDF file. 
fp = open('/Users/me/Downloads/test.pdf', 'rb') 

# Create a PDF parser object associated with the file object. 
parser = PDFParser(fp) 

# Create a PDF document object that stores the document structure. 
# Password for initialization as 2nd parameter 
document = PDFDocument(parser) 

# Check if the document allows text extraction. If not, abort. 
if not document.is_extractable: 
    raise PDFTextExtractionNotAllowed 

# Create a PDF resource manager object that stores shared resources. 
rsrcmgr = PDFResourceManager() 

# Create a PDF device object. 
device = PDFDevice(rsrcmgr) 

# BEGIN LAYOUT ANALYSIS 
# Set parameters for analysis. 
laparams = LAParams() 

# Create a PDF page aggregator object. 
device = PDFPageAggregator(rsrcmgr, laparams=laparams) 

# Create a PDF interpreter object. 
interpreter = PDFPageInterpreter(rsrcmgr, device) 

def parse_obj(lt_objs): 

    # loop over the object list 
    for obj in lt_objs: 

     # if it's a textbox, print text and location 
     if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal): 
      print "%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text().replace('\n', '_')) 

     # if it's a container, recurse 
     elif isinstance(obj, pdfminer.layout.LTFigure): 
      parse_obj(obj._objs) 

# loop over all pages in the document 
for page in PDFPage.create_pages(document): 

    # read the page into a layout object 
    interpreter.process_page(page) 
    layout = device.get_result() 

    # extract text from this object 
    parse_obj(layout._objs)