BeautifulSoup`find_all`生成器

有什麼辦法可以將find_all轉換成更高效的內存生成器嗎？例如：BeautifulSoup`find_all`生成器

考慮：

soup = BeautifulSoup(content, "html.parser") 
return soup.find_all('item')

我想改用：

soup = BeautifulSoup(content, "html.parser") 
while True: 
    yield soup.next_item_generator()

（假設最終StopIteration例外的適當移交）

有內置的一些發電機，但不能在查找中產生下一個結果。 find只返回第一項。有了成千上萬的物品，find_all吸了很多的記憶。對於5792個項目，我看到只有1GB以上的RAM。

我很清楚，有更高效的解析器，如lxml，可以實現這一點。讓我們假設還有其他商業限制阻止我使用其他任何東西。

我怎樣才能把find_all到發電機通過更內存使用效率方式進行迭代。

來源

2016-12-29 Jamie Counsell

沒有「找到」發電機BeautifulSoup，從我所知道的，但是我們可以結合使用SoupStrainer和.children generator。

讓我們想象一下，我們有這個樣本HTML：

<div> 
    <item>Item 1</item> 
    <item>Item 2</item> 
    <item>Item 3</item> 
    <item>Item 4</item> 
    <item>Item 5</item> 
</div>

從中我們需要把所有item節點的文本。

我們可以使用SoupStrainer解析只有item標籤，然後遍歷.children發電機，並獲得文：

from bs4 import BeautifulSoup, SoupStrainer 

data = """ 
<div> 
    <item>Item 1</item> 
    <item>Item 2</item> 
    <item>Item 3</item> 
    <item>Item 4</item> 
    <item>Item 5</item> 
</div>""" 

parse_only = SoupStrainer('item') 
soup = BeautifulSoup(data, "html.parser", parse_only=parse_only) 
for item in soup.children: 
    print(item.get_text())

打印：

Item 1 
Item 2 
Item 3 
Item 4 
Item 5

換句話說，這個想法是將樹切成所需的標籤並使用one of the available generators，如.children。您也可以直接使用這些生成器中的一個，並通過生成器主體內的名稱或其他標準手動過濾標籤，例如是這樣的：

def generate_items(soup): 
    for tag in soup.descendants: 
     if tag.name == "item": 
      yield tag.get_text()

的.descendants產生的子元素遞歸，而.children只會考慮一個節點的直接孩子。

來源

2016-12-29 02:14:59 alecxe

美麗。偉大的方式來看問題。 –

非常好的解決方案:) – Dekel

最簡單的方法是使用find_next：

soup = BeautifulSoup(content, "html.parser") 

def find_iter(tagname): 
    tag = soup.find(tagname) 
    while tag is not None: 
     yield tag 
     tag = tag.find_next(tagname)

來源

2016-12-29 02:29:16 ekhumoro

'find_next（）'是一個有趣的想法！ – alecxe

@alecxe。關於它的另一個好處是它允許從文檔中的任何一點開始。 – ekhumoro

不錯，看起來像是替代我的「發現」發電機。謝謝。 – alecxe

Document：

我給發電機PEP兼容8名，並進行改造成屬性：

childGenerator() -> children 
nextGenerator() -> next_elements 
nextSiblingGenerator() -> next_siblings 
previousGenerator() -> previous_elements 
previousSiblingGenerator() -> previous_siblings 
recursiveChildGenerator() -> descendants 
parentGenerator() -> parents

有章節在名爲Generators的文檔中，您可以閱讀它。

SoupStrainer只會解析html的一部分，它可以節省內存，但它只會排除不相關的標記，如果你的html有很多標記，你會得到相同的內存問題。

來源

2016-12-29 02:29:54

BeautifulSoup`find_all`生成器

回答

相關問題