2017-04-26 91 views
1

我要檢索docx文件表和一個/下一個段落,但無法想象如何與python-docx檢索文檔內容與文檔結構與Python,DOCX

獲得這個我可以通過段落的列表document.paragraphs

我可以document.tables

我怎樣才能得到這樣

[ 
Paragraph1, 
Paragraph2, 
Table1, 
Paragraph3, 
Table3, 
Paragraph4, 
... 
]? 
文檔元素的有序列表獲取表的列表

回答

1

python-docx還沒有API支持;有趣的是,Microsoft Word API也沒有。

但是你可以用下面的代碼解決這個問題。請注意,這是一個有點脆弱,因爲它使用的python-docx內部即有可能發生變化,但我相信它會工作得很好,在可預見的未來:

#!/usr/bin/env python 
# encoding: utf-8 

""" 
Testing iter_block_items() 
""" 

from __future__ import (
    absolute_import, division, print_function, unicode_literals 
) 

from docx import Document 
from docx.document import Document as _Document 
from docx.oxml.text.paragraph import CT_P 
from docx.oxml.table import CT_Tbl 
from docx.table import _Cell, Table 
from docx.text.paragraph import Paragraph 


def iter_block_items(parent): 
    """ 
    Generate a reference to each paragraph and table child within *parent*, 
    in document order. Each returned value is an instance of either Table or 
    Paragraph. *parent* would most commonly be a reference to a main 
    Document object, but also works for a _Cell object, which itself can 
    contain paragraphs and tables. 
    """ 
    if isinstance(parent, _Document): 
     parent_elm = parent.element.body 
     # print(parent_elm.xml) 
    elif isinstance(parent, _Cell): 
     parent_elm = parent._tc 
    else: 
     raise ValueError("something's not right") 

    for child in parent_elm.iterchildren(): 
     if isinstance(child, CT_P): 
      yield Paragraph(child, parent) 
     elif isinstance(child, CT_Tbl): 
      yield Table(child, parent) 


document = Document('test.docx') 
for block in iter_block_items(document): 
    print('found one') 
    print(block.text if isinstance(block, Paragraph) else '<table>') 

還有就是這這裏的一些更多的討論:
https://github.com/python-openxml/python-docx/issues/276