用BeautifulSoup包裝多個標籤

我正在寫一個python腳本，允許將html文檔轉換爲reveal.js幻燈片。爲此，我需要在<section>標籤內包裝多個標籤。用BeautifulSoup包裝多個標籤

使用wrap()方法很容易將單個標籤包裹在另一個標籤內。不過，我無法弄清楚如何包裝多個標籤。

澄清一個例子，原始的HTML：

html_doc = """ 
<html> 

<head> 
    <title>The Dormouse's story</title> 
</head> 

<body> 

    <h1 id="first-paragraph">First paragraph</h1> 
    <p>Some text...</p> 
    <p>Another text...</p> 
    <div> 
    <a href="http://link.com">Here's a link</a> 
    </div> 

    <h1 id="second-paragraph">Second paragraph</h1> 
    <p>Some text...</p> 
    <p>Another text...</p> 

    <script src="lib/.js"></script> 
</body> 

</html> 
""" 


"""

我想包住<h1>和他們的下一個標籤內<section>標籤，就像這樣：

<html> 
<head> 
    <title>The Dormouse's story</title> 
</head> 
<body> 

    <section> 
    <h1 id="first-paragraph">First paragraph</h1> 
    <p>Some text...</p> 
    <p>Another text...</p> 
    <div> 
     <a href="http://link.com">Here's a link</a> 
    </div> 
    </section> 

    <section> 
    <h1 id="second-paragraph">Second paragraph</h1> 
    <p>Some text...</p> 
    <p>Another text...</p> 
    </section> 

    <script src="lib/.js"></script> 
</body> 

</html>

下面是如何做選擇：

from bs4 import BeautifulSoup 
import itertools 
soup = BeautifulSoup(html_doc) 
h1s = soup.find_all('h1') 
for el in h1s: 
    els = [i for i in itertools.takewhile(lambda x: x.name not in [el.name, 'script'], el.next_elements)] 
    els.insert(0, el) 
    print(els)

產量：

[<h1 id="first-paragraph">First paragraph</h1>, 'First paragraph', '\n ', <p>Some text...</p>, 'Some text...', '\n ', <p>Another text...</p>, 'Another text...', '\n ', <div><a href="http://link.com">Here's a link</a> </div>, '\n ', <a href="http://link.com">Here's a link</a>, "Here's a link", '\n ', '\n\n '] 

[<h1 id="second-paragraph">Second paragraph</h1>, 'Second paragraph', '\n ', <p>Some text...</p>, 'Some text...', '\n ', <p>Another text...</p>, 'Another text...', '\n\n ']

的選擇是正確的，但我看不出如何包裝一個<section>標籤內的每個選擇。

來源

2015-08-28 Ben

你能編輯你的文章並顯示預期的輸出嗎？ – styvane

請發佈預期的輸出。 –

我添加了顯式輸出。 – Ben

最後我發現在這種情況下如何使用wrap方法。我需要明白，湯對象的每一個變化是在地方。

from bs4 import BeautifulSoup 
import itertools 
soup = BeautifulSoup(html_doc) 

# wrap all h1 and next siblings into sections 
h1s = soup.find_all('h1') 
for el in h1s: 
    els = [i for i in itertools.takewhile(
       lambda x: x.name not in [el.name, 'script'], 
       el.next_siblings)] 
    section = soup.new_tag('section') 
    el.wrap(section) 
    for tag in els: 
     section.append(tag) 

print(soup.prettify())

這給了我想要的輸出。希望這是幫助。

來源

2015-08-29 19:43:13 Ben

謝謝。我想指出我學到的一些可能並不明顯的事情。 1）在別處附加標籤（例如通過追加）將其從其先前位置移除。 2）由於（1），因爲.next_siblings是一個生成器，而不是一個列表，所以在迭代通過調用section.append（tag）的循環之前，需要將它轉換爲列表。您的複雜'els = [... ]'那樣做。我不需要過濾，所以我嘗試了'els = el.next_siblings'。這失敗了，因爲兄弟姐妹的第一步打破了兄弟姐妹鏈。 'els = list（el.next_siblings）'有效。 – wojtow

用BeautifulSoup包裝多個標籤

回答

相關問題