beautifulsoup解析html文件內容

我在一個文件夾中有30911個html文件。我需要（1）檢查它是否包含標籤：beautifulsoup解析html文件內容

<strong>123</strong>

（2）提取下列內容，直到本節結束。

但是我發現一個問題是，他們中的一些前

<strong>567</strong>

而且他們中的一些結束沒有這樣的標籤，這是結束前

<strong>89/strong> or others(that I do not know because I cant check 30K+files)

它也有不同磷p_number在每個文件和有時沒有編號

所以首先我用美麗的搜索，但我不知道如何做下一個提取內容

soup = bs4.BeautifulSoup(fo, "lxml") 
m = soup.find("strong", string=re.compile("123"))

順便說一下，是可以將內容保存爲txt格式，但它會看起來像html格式？

line 1 
line 2 
... 
lin 50

如果使用p.get_text（strip = true），它就在一起。

line1 content line2 content ... 
line50 content....

來源

2017-05-28 Michael Lin

如果我理解正確的話，你可以先找到切入點 - 具有與「問題和回答會話」文本strong元素的p元素。然後，您可以遍歷p元素的next siblings，直到您點擊具有「版權政策」文本的strong元素。

完全reproduceable例如：

import re 

from bs4 import BeautifulSoup 


data = """ 
<body> 
    <p class="p p4" id="question-answer-session"> 
     <strong> 
     Question-and-Answer Session 
     </strong> 
    </p> 

    <p class="p p4"> 
     Hi John and Greg, good afternoon. contents.... 
    </p> 

    <p class="p p14"> 
     <strong> 
     Copyright policy: 
     </strong> 
     other content about the policy.... 
    </p> 
</body> 
""" 

soup = BeautifulSoup(data, "html.parser") 

def find_question_answer(tag): 
    return tag.name == 'p' and tag.find("strong", text=re.compile(r"Question-and-Answer Session")) 

question_answer = soup.find(find_question_answer) 
for p in question_answer.find_next_siblings("p"): 
    if p.find("strong", text=re.compile(r"Copyright policy")): 
     break 

    print(p.get_text(strip=True))

打印：

Hi John and Greg, good afternoon. contents....

來源

2017-05-28 03:23:11 alecxe

如果我寫的內容到一個新的HTML文件，該格式將被雖然搞砸了。 –

@MichaelLin沒關係，你想寫入文件的哪一部分？ – alecxe

我想我解決它，我使用 p.prettify（）。encode（'ascii'，'ignore'）。decode（'utf-8'，'ignore'）然後它只保存版權 –

beautifulsoup解析html文件內容

回答

相關問題