Python BeautifulSoup從find_all返回錯誤輸入列表（）

我有Python 2.7.3和bs。版本是4.4.1Python BeautifulSoup從find_all返回錯誤輸入列表（）

出於某種原因，這個代碼

from bs4 import BeautifulSoup # parsing 

html = """ 
<html> 
<head id="Head1"><title>Title</title></head> 
<body> 
    <form id="form" action="login.php" method="post"> 
     <input type="text" name="fname"> 
     <input type="text" name="email" > 
     <input type="button" name="Submit" value="submit"> 
    </form> 
</body> 

</html> 
""" 

html_proc = BeautifulSoup(html, 'html.parser') 

for form in html_proc.find_all('form'): 
    for input in form.find_all('input'): 
     print "input:" + str(input)

返回輸入一個錯誤的列表：

input:<input name="fname" type="text"> 
<input name="email" type="text"> 
<input name="Submit" type="button" value="submit"> 
</input></input></input> 
input:<input name="email" type="text"> 
<input name="Submit" type="button" value="submit"> 
</input></input> 
input:<input name="Submit" type="button" value="submit"> 
</input>

它應該返回

input: <input name="fname" type="text"> 
input: <input type="text" name="email"> 
input: <input type="button" name="Submit" value="submit">

發生了什麼事？

來源

2017-04-07 Arrow

對我來說，這看起來像HTML解析器的神器。解析器使用'lxml'而不是'html.parser'似乎使其工作。缺點是你（或你的用戶）需要安裝lxml - 好處是lxml是更好/更快的解析器;-)。

至於爲什麼'html.parser'似乎並沒有在這種情況下正常工作，我覺得這事做的事實，input標籤是自閉。如果你明確地閉上你的投入，它的工作原理：

<input type="text" name="fname" ></input> 
<input type="text" name="email" ></input> 
<input type="button" name="Submit" value="submit" ></input>

我會好奇，看看我們是否可以修改的源代碼來處理這種情況......做一個小實驗，猴子補丁bs4表明我們可以做到這一點：

from bs4 import BeautifulSoup 

from bs4.builder import _htmlparser 

# Monkey-patch the Beautiful soup HTML parser to close input tags automatically. 
BeautifulSoupHTMLParser = _htmlparser.BeautifulSoupHTMLParser 
class FixedParser(BeautifulSoupHTMLParser): 
    def handle_starttag(self, name, attrs): 
     # Old-style class... No super :-(
     result = BeautifulSoupHTMLParser.handle_starttag(self, name, attrs) 
     if name.lower() == 'input': 
      self.handle_endtag(name) 
     return result 

_htmlparser.BeautifulSoupHTMLParser = FixedParser 


html = """ 
<html> 
<head id="Head1"><title>Title</title></head> 
<body> 
    <form id="form" action="login.php" method="post"> 
     <input type="text" name="fname" > 
     <input type="text" name="email" > 
     <input type="button" name="Submit" value="submit" > 
    </form> 
</body> 

</html> 
""" 

html_proc = BeautifulSoup(html, 'html.parser') 

for form in html_proc.find_all('form'): 
    for input in form.find_all('input'): 
     print "input:" + str(input)

顯然，這是不是一個真正的修復（我不會提出這是一個補丁，BS4人），但它確實說明問題。由於沒有結束標籤，所以handle_endtag方法永遠不會被調用。如果我們自己調用它，事情往往會發生（只要html不也有一個關閉輸入標記...）。

我真的不知道誰的責任這個錯誤是應該的，但我想，你可以通過它提交到BS4開始 - 他們可能再往前你報告蟒跟蹤的錯誤，我不當然...

來源

2017-04-07 20:33:17 mgilson

謝謝。這工作。奇怪的是，我不得不用來結束我的輸入，因爲那不是標準的HTML代碼。 https://www.w3schools.com/tags/tag_input.asp，如果有人可以報告給適當的人，這將不勝感激。 – Arrow

@Arrow - 我可能會從https://bugs.launchpad.net/beautifulsoup/報告錯誤開始 – mgilson

不要使用嵌套循環爲此，和使用lxml，你的代碼改成這樣：

inp = [] 
html_proc = BeautifulSoup(html, 'lxml') 

for form in html_proc.find_all('form'): 
    inp.extend(form.find_all('input')) 

for item in inp:  
    print "input:" + str(item)

來源

2017-04-07 20:30:04 RaminNietzsche

Python BeautifulSoup從find_all返回錯誤輸入列表（）

回答

相關問題