在html2txt後清理文本

我使用lxml將html轉換爲txt。我幾乎到達了我想要解析，轉換和清理（製表符，空格，空行）功能的一些部分，並且已經啓動並運行了一個程序。在html2txt後清理文本

然而，當我嘗試了我的代碼以大約一百HTMLS（均來自不同的網站），我注意到一些例外，如即行：

#wrapper #PrimaryNav {margin:0;*overflow:hidden;} 
a.scbbtnred{background-position:right -44px;} 
a.scbbtnblack{background-position:right -176px;} 
.ghsearch{width:58px;height:21px;line-height:21px;background-position:0 -80px;} 
a.scbbtnred span span{background-color:#f00;background-position:0 -22px;}

我認爲這是CSS？或其他網絡編程的東西。但我完全不熟悉這些。

問題：這些是什麼？還有關於如何搭配這些線的建議？

編輯：這裏是我這個問題對於任何參考誰落入這個職位在未來（新的Python，很多東西在這裏可以得到改善之前做的部分，但它的工作原理對我來說）：

# Function for html2txt using lxml 
# Author: 
# http://groups.google.com/group/cn.bbs.comp.lang.python/browse_thread/thread/781a357e2ce66ce8 
def html2text(html): 
    tree = lxml.etree.fromstring(html, lxml.etree.HTMLParser()) if isinstance(html, basestring) else html 
    for skiptag in ('//script', '//iframe', '//style'):  
     for node in tree.xpath(skiptag): 
      node.getparent().remove(node) 
    # return lxml.etree.tounicode(tree, method='text') 
    return lxml.etree.tostring(tree, encoding=unicode, method='text') 



#Function for cleanup the text: 
# 1: clearnup: 1)tabs, 2)spaces, 3)empty lines; 
# 2: remove short lines 
def textcleanup(text): 
    # temp list for process 
    text_list = [] 
    for s in text.splitlines(): 
     # Strip out meaningless spaces and tabs 
     s = s.strip() 
     # Set length limit 
     if s.__len__() > 35: 
      text_list.append(s) 
    cleaned = os.linesep.join(text_list) 
    # Get rid of empty lines 
    cleaned = os.linesep.join([s for s in cleaned.splitlines() if s]) 
    return cleaned

來源

2011-10-22 Flake

這的確是CSS。您正在獲取如下文檔：

<style> 
#wrapper #PrimaryNav {margin:0;*overflow:hidden;} 
a.scbbtnred{background-position:right -44px;} 
a.scbbtnblack{background-position:right -176px;} 
.ghsearch{width:58px;height:21px;line-height:21px;background-position:0 -80px;} 
a.scbbtnred span span{background-color:#f00;background-position:0 -22px;} 
</style> 
<div> 
    <p>This bit is HTML</p> 
</div>

您需要在解析出文本之前刪除所有style標記。

來源

2011-10-22 22:34:01 Eric

嗨，Eric，這正是我在找的東西。謝謝！ – Flake

在html2txt後清理文本

回答

相關問題