用美麗的湯解析不平衡的html文件4

我解析了部分與平衡html標籤不兼容的html文件。用美麗的湯解析不平衡的html文件4

說這部分html文件中缺少第一行。是否有可能美麗的湯仍然可以解析其餘的文件，我仍然可以提取不同標籤的信息內部？

非常感謝您的幫助。

Example Domain</title> <!-- <====missing tag in this line --> 

<meta charset="utf-8" /> 
<meta http-equiv="Content-type" content="text/html; charset=utf-8" /> 
<meta name="viewport" content="width=device-width, initial-scale=1" /> 
<style type="text/css"> 
body { 
    background-color: #f0f0f2; 
    margin: 0; 
    padding: 0; 
    font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif; 

} 
div { 
    width: 600px; 
    margin: 5em auto; 
    padding: 50px; 
    background-color: #fff; 
    border-radius: 1em; 
} 
a:link, a:visited { 
    color: #38488f; 
    text-decoration: none; 
} 
@media (max-width: 700px) { 
    body { 
     background-color: #fff; 
    } 
    div { 
     width: auto; 
     margin: 0 auto; 
     border-radius: 0; 
     padding: 1em; 
    } 
} 
</style>

來源

2017-01-23 DBS

您將需要指定一個不是默認的解析器。你可以嘗試'lxml'或'html5lib'。我沒有任何經驗。 – Alden

這就是我在嘗試使用lxml時得到的結果「bs4.FeatureNotFound：找不到具有您請求的功能的樹生成器：lxml。是否需要安裝解析器庫？切換到html5lib解析器時，我收到了類似的錯誤消息「bs4.FeatureNotFound：找不到具有您請求的功能的樹生成器：html5lib。是否需要安裝解析器庫？我試圖pip安裝這兩個庫，但失敗了。我正在使用OSX 10.9.5。 Python3.4.4。任何想法表示讚賞！ – DBS

您是否收到包含pip的錯誤消息？我沒有'pip安裝html5lib'，下面的代碼適用於我'from bs4 import BeautifulSoup;湯= BeautifulSoup（「 asdf」，「html5lib」）;打印（湯）' – Alden

使用任何高級解析器（html5lib更健壯，但速度更慢）。結果將有所不同：

soup = BeautifulSoup(open('foo.html'), 'lxml') 
#<html><body><p>Example Domain <!-- <====missing tag in this line --> 
#<meta charset="utf-8"/> 

soup = BeautifulSoup(open('foo.html'), 'html5lib') 
#<html><head></head><body>Example Domain <!-- <====missing tag in this line --> 
# 
#<meta charset="utf-8"/>

來源

2017-01-23 18:56:11 DyZ

用美麗的湯解析不平衡的html文件4

回答

相關問題