在Python中從Word文檔（.docx）中提取突出顯示的單詞

我正在處理一些單詞文檔，其中我突出顯示了使用顏色代碼（例如黃色，藍色，灰色）的文本（單詞），現在我想要提取與每種顏色相關的突出顯示的單詞。我使用Python進行編程。這是我目前所做的：在Python中從Word文檔（.docx）中提取突出顯示的單詞

用[python-docx][1]打開word文檔，然後轉到<w:r>標籤，其中包含文檔中的標記（詞）。我用下面的代碼：

#!/usr/bin/env python2.6 
# -*- coding: ascii -*- 
from docx import * 
document = opendocx('test.docx') 
words = document.xpath('//w:r', namespaces=document.nsmap) 
for word in words: 
    print word

現在我被困在那裏我檢查每一個字，如果它有<w:highlight>標籤，並從中提取顏色代碼的一部分，如果它裏面<w:t>標籤黃色打印文本相匹配。如果有人能指點我從解析文件中提取單詞，我將非常感激。

來源

2012-03-05 Shreyas Karnik

我從來沒有與python-docx合作過，但什麼幫助的是，我發現了一個片段在網上的怎麼樣了高亮顯示一段文字lookls的XML結構：

<w:r> 
    <w:rPr> 
     <w:highlight w:val="yellow"/> 
    </w:rPr> 
    <w:t>text that is highlighted</w:t> 
    </w:r>

從那裏，它是相對簡單拿出這個：

from docx import * 
document = opendocx(r'test.docx') 
words = document.xpath('//w:r', namespaces=document.nsmap) 

WPML_URI = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}" 
tag_rPr = WPML_URI + 'rPr' 
tag_highlight = WPML_URI + 'highlight' 
tag_val = WPML_URI + 'val' 

for word in words: 
    for rPr in word.findall(tag_rPr): 
     if rPr.find(tag_highlight).attrib[tag_val] == 'yellow': 
      print word.find(tag_t).text

來源

2012-03-05 10:53:25 BioGeek

謝謝@BioGeek :)它的工作很棒！ :) – 2012-03-05 15:09:17

我做了一些小的改動（缺少tag_t的聲明和處理ascii爲utf8字符）修改後的代碼可以在https://gist.github.com/1982168上再次感謝@BioGeek！ – 2012-03-05 23:53:00

不客氣。這是一個很酷的問題，我也學到了一些新東西。來自生物信息學家的問候！ – BioGeek 2012-03-06 09:34:42

在Python中從Word文檔（.docx）中提取突出顯示的單詞

回答

相關問題