2016-07-06 130 views
2

我想從一堆本地html文件抓取原始html。我從這個職位的一些幫助獲取原始文件閱讀:如何從Python中的本地文件抓取某個XPath內的原始所有原始html

Get all text inside a tag lxml

但我的代碼目前生產的整個文件,而不是一個子集。現在我似乎錯過了一條線,我可以選擇一個我想抓取的xpath。

這裏是我目前擁有的代碼:

def stringify_children(node): 
    from lxml.etree import tostring 
    from itertools import chain 
    parts = ([node.text] + 
      list(chain(*([c.text, tostring(c), c.tail] for c 
      in node.getchildren()))) + 
      [node.tail]) 
    # filter removes possible Nones in texts and tails 
    return ''.join(filter(None, parts)) 

for filename in os.listdir('../news/article/'): 
    if (filename.endswith('.html') and not filename.startswith('._')): 
     print filename; 
     with open('../news/article/' + filename, "r") as f: 
      page=f.read(); 
     tree=html.fromstring(page); 
     maincontent = stringify_children(tree); 
     print maincontent; 

我的最終目標是能夠得到在一個字符串輸出到本地文件,因爲只有該專區。

下面是一個示例文件:

<html> 

<head> 
    <title>Title</title> 
    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css"> 
</head> 

<body> 
    <div class="container"> 
     <div class="row"> 
      <div class="col-xs-4"> 
       <div class="left-bar"></div> 
      </div> 
      <div class="col-xs-4"> 
       <div class="middle-bar"></div> 
      </div> 
      <div class="col-xs-4"> 
       <div class="right-bar"></div> 
      </div> 
     </div> 
     <div class="row"> 
      <div class="col-xs-3"> 
       <div class="navigation"></div> 
      </div> 
      <div class="col-xs-9"> 
       <div class="main-content"> 
        Hello 
        <br> 
        <br><a href="http://www.stackexchange.com">Click here to visit stack exchange</a> 
        <h1>This is an introduction</h1> 
        <h3>This is the third header</h3> 
        <p>Lorem ipsum dolor sit amet.....</p> 
        <p>Lorem ipsum dolor sit amet.....</p> 
        <p>Lorem ipsum dolor sit amet.....</p> 
        <ul> 
         <li>list text</li> 
         <li>list text</li> 
         <li>list text</li> 
         <li>list text</li> 
        </ul> 
        <div class="row"> 
         <div class="col-xs-4"><img src="#">More content 1</div> 
         <div class="col-xs-4"><img src="#">More content 2</div> 
         <div class="col-xs-4"><img src="#">More content 3</div> 
        </div> 

       </div> 
      </div> 
     </div> 
    </div> 

</body> 

</html> 

我想抓住所有的搜索Maincontent類下面的內容。下面是類的在該文件中的XPath:

的XPath:/ HTML /體/ DIV/DIV [2]/DIV [2] /格

程序應該輸出以下:

    Hello 
        <br> 
        <br><a href="http://www.stackexchange.com">Click here to visit stack exchange</a> 
        <h1>This is an introduction</h1> 
        <h3>This is the third header</h3> 
        <p>Lorem ipsum dolor sit amet.....</p> 
        <p>Lorem ipsum dolor sit amet.....</p> 
        <p>Lorem ipsum dolor sit amet.....</p> 
        <ul> 
         <li>list text</li> 
         <li>list text</li> 
         <li>list text</li> 
         <li>list text</li> 
        </ul> 
        <div class="row"> 
         <div class="col-xs-4"><img src="#">More content 1</div> 
         <div class="col-xs-4"><img src="#">More content 2</div> 
         <div class="col-xs-4"><img src="#">More content 3</div> 
        </div> 
+0

所以你不想在div本身?這會給你破壞HTML你確定你想要嗎? –

+0

是的。我確定,因爲我將把數據導入已經創建了該標籤的新html文檔。 –

回答

0

你可以嘗試使用BeautifulSoup。我不是真正的精通它,但你可以做這樣的事情(或清潔劑,如果你在閱讀BeautifulSoup了:)

from bs4 import BeautifulSoup 
soup = BeautifulSoup(open("input.html"), 'html') 
x = soup.find_all(class_="main-content") 
for line in x[0].contents: 
    print line, 

你會得到這樣的輸出:

 Hello 
     <br/> 
<br/> <a href="http://www.stackexchange.com">Click here to visit stack exchange</a> 
<h1>This is an introduction</h1> 
<h3>This is the third header</h3> 
<p>Lorem ipsum dolor sit amet.....</p> 
<p>Lorem ipsum dolor sit amet.....</p> 
<p>Lorem ipsum dolor sit amet.....</p> 
<ul> 
<li>list text</li> 
<li>list text</li> 
<li>list text</li> 
<li>list text</li> 
</ul> 
<div class="row"> 
<div class="col-xs-4"><img src="#"/>More content 1</div> 
<div class="col-xs-4"><img src="#"/>More content 2</div> 
<div class="col-xs-4"><img src="#"/>More content 3</div> 
</div> 

BeautifulSoup將「修復」HTML語法,如從

的變化,並且它將保持元素內部的空間。請參閱該文檔就可以在:https://www.crummy.com/software/BeautifulSoup/bs4/doc/

0

使用LXML:

from lxml import html 

xm = html.fromstring(h) 
div = xm.xpath("//div[@class='main-content']")[0] 
print(div.text + "".join(map(html.tostring, div.xpath("./*")))) 

或者:

from lxml import html 

xm = html.fromstring(h) 
eles = xm.xpath("//div[@class='main-content']/text() | //div[@class='main-content']/*") 
print("".join([ele if isinstance(ele, str) else html.tostring(ele) for ele in eles])) 
相關問題