解析使用BeautifulSoup

一個html文件，我有這個HTML文件：解析使用BeautifulSoup

<html> 
    <head></head> 
    <body> 
     Text1 
     Text2 
     <a href="XYCL7Q.html"> 
      Text3 
     </a> 
    </body> 
</html>

我想收集分別文本1，文本2和文本3。對於Text3我沒有問題，但我無法捕獲Text1-2;通過這樣做：

from urllib import urlopen 
from bs4 import BeautifulSoup 

url = 'myUrl'; 
html = urlopen(url).read() 
soup = BeautifulSoup(html) 
soup.body.get_text()

我得到的所有文本（第一個問題，因爲我得到的文本3再次）沒有得到很好的分離，因爲Text1-2可能包含一些空間......舉例來說，如果文本1是「世界你好」和Text2「foo bar」，最後我想列出2個字符串：

results = ['hello world', 'foo bar']

我該怎麼做？謝謝你的回答...

來源

2014-12-05 accand

你想要的文本是「body」的第一個子節點。你可以把它拉出來並剝離污跡

>>> from bs4 import BeautifulSoup as bs 
>>> soup=bs("""<html> 
...  <head></head> 
...  <body> 
...   Text1 
...   Text2 
...   <a href="XYCL7Q.html"> 
...    Text3 
...   </a> 
...  </body> 
... </html>""") 
... 
>>> body=soup.find('body') 
>>> type(next(body.children)) 
<class 'bs4.element.NavigableString'> 
>>> next(body.children) 
u'\n  Text1 \n  Text2\n  ' 
>>> [stripped for stripped in (item.strip() for item in next(body.children).split('\n')) if stripped] 
[u'Text1', u'Text2']

來源

2014-12-05 18:10:20 tdelaney

解析使用BeautifulSoup

回答

相關問題