我想從一堆本地html文件抓取原始html。我從這個職位的一些幫助獲取原始文件閱讀:如何從Python中的本地文件抓取某個XPath內的原始所有原始html
Get all text inside a tag lxml
但我的代碼目前生產的整個文件,而不是一個子集。現在我似乎錯過了一條線,我可以選擇一個我想抓取的xpath。
這裏是我目前擁有的代碼:
def stringify_children(node):
from lxml.etree import tostring
from itertools import chain
parts = ([node.text] +
list(chain(*([c.text, tostring(c), c.tail] for c
in node.getchildren()))) +
[node.tail])
# filter removes possible Nones in texts and tails
return ''.join(filter(None, parts))
for filename in os.listdir('../news/article/'):
if (filename.endswith('.html') and not filename.startswith('._')):
print filename;
with open('../news/article/' + filename, "r") as f:
page=f.read();
tree=html.fromstring(page);
maincontent = stringify_children(tree);
print maincontent;
我的最終目標是能夠得到在一個字符串輸出到本地文件,因爲只有該專區。
下面是一個示例文件:
<html>
<head>
<title>Title</title>
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/css/bootstrap.min.css">
</head>
<body>
<div class="container">
<div class="row">
<div class="col-xs-4">
<div class="left-bar"></div>
</div>
<div class="col-xs-4">
<div class="middle-bar"></div>
</div>
<div class="col-xs-4">
<div class="right-bar"></div>
</div>
</div>
<div class="row">
<div class="col-xs-3">
<div class="navigation"></div>
</div>
<div class="col-xs-9">
<div class="main-content">
Hello
<br>
<br><a href="http://www.stackexchange.com">Click here to visit stack exchange</a>
<h1>This is an introduction</h1>
<h3>This is the third header</h3>
<p>Lorem ipsum dolor sit amet.....</p>
<p>Lorem ipsum dolor sit amet.....</p>
<p>Lorem ipsum dolor sit amet.....</p>
<ul>
<li>list text</li>
<li>list text</li>
<li>list text</li>
<li>list text</li>
</ul>
<div class="row">
<div class="col-xs-4"><img src="#">More content 1</div>
<div class="col-xs-4"><img src="#">More content 2</div>
<div class="col-xs-4"><img src="#">More content 3</div>
</div>
</div>
</div>
</div>
</div>
</body>
</html>
我想抓住所有的搜索Maincontent類下面的內容。下面是類的在該文件中的XPath:
的XPath:/ HTML /體/ DIV/DIV [2]/DIV [2] /格
程序應該輸出以下:
Hello
<br>
<br><a href="http://www.stackexchange.com">Click here to visit stack exchange</a>
<h1>This is an introduction</h1>
<h3>This is the third header</h3>
<p>Lorem ipsum dolor sit amet.....</p>
<p>Lorem ipsum dolor sit amet.....</p>
<p>Lorem ipsum dolor sit amet.....</p>
<ul>
<li>list text</li>
<li>list text</li>
<li>list text</li>
<li>list text</li>
</ul>
<div class="row">
<div class="col-xs-4"><img src="#">More content 1</div>
<div class="col-xs-4"><img src="#">More content 2</div>
<div class="col-xs-4"><img src="#">More content 3</div>
</div>
所以你不想在div本身?這會給你破壞HTML你確定你想要嗎? –
是的。我確定,因爲我將把數據導入已經創建了該標籤的新html文檔。 –