如何使用python和BeautifulSoup無標籤地從HTML返回文本？

我被困在試圖從網站返回文本。我想從下面的例子中返回ownerId和unitId。任何幫助是極大的讚賞。如何使用python和BeautifulSoup無標籤地從HTML返回文本？

<script> 
    h1.config.days = "7"; 
    h1.config.hours = "24"; 
    h1.config.color = "blue"; 
    h1.config.ownerId = 7321; 
    h1.config.locationId = 1258; 
    h1.config.unitId = "164"; 
</script>

來源

2017-08-30 Justin Hill

由於這部分不是html，使用正則表達式來提取你想要的數據 – balki

你可以使用Beautiful Soup像這樣：

#!/usr/bin/env python 

from bs4 import BeautifulSoup 

html = ''' 
<script> 
    h1.config.days = "7"; 
    h1.config.hours = "24"; 
    h1.config.color = "blue"; 
    h1.config.ownerId = 7321; 
    h1.config.locationId = 1258; 
    h1.config.unitId = "164"; 
</script> 
''' 

soup = BeautifulSoup(html, "html.parser") 
jsinfo = soup.find("script") 

d = {} 
for line in jsinfo.text.split('\n'): 
    try: 
     d[line.split('=')[0].strip().replace('h1.config.','')] = line.split('=')[1].lstrip().rstrip(';') 
    except IndexError: 
     pass 

print 'OwnerId: {}'.format(d['ownerId']) 
print 'UnitId: {}'.format(d['unitId'])

這將產生以下結果：

OwnerId: 7321 
UnitId: "164"

而且這樣你可以過訪問任何其他變量，通過做d['variable'] 。

更新的情況下，

現在你要處理多個<script>標籤，通過這些迭代，你可以這樣做：現在

jsinfo = soup.find_all("script")

，jsinfo是<class 'bs4.element.ResultSet'>類型，你可以遍歷像正常的列表。

我們提取LAT和LON你可以簡單地做：

#!/usr/bin/env python 

from bs4 import BeautifulSoup 
import requests 

url = 'https://www.your_url' 
# the user-agent you specified in the comments 
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'} 

html = requests.get(url, headers=headers).text 
soup = BeautifulSoup(html, "html.parser") 
jsinfo = soup.find_all("script") 

list_of_interest = ['hl.config.lat', 'hl.config.lon'] 

d = {} 
for line in jsinfo[9].text.split('\n'): 
    if any(word in line for word in list_of_interest): 
     k,v = line.strip().replace('hl.config.','').split(' = ') 
     d[k] = v.strip(';') 

print 'Lat => {}'.format(d['lat']) 
print 'Lon => {}'.format(d['lon'])

這將產生以下結果：

Lat => "28.06794" 
Lon => "-81.754349"

通過list_of_interest附加更多的價值，你可以訪問某些其他變量，如果你喜歡！

來源

2017-08-30 20:56:16 coder

謝謝你的迴應。如果有多個，這是如何工作的？ –

@JustinHill，我更新了答案！ – coder

此外，我使用urllib.request.Request使用標題['User-Agent'] =「Mozilla/5.0（X11; Linux i686）AppleWebKit/537.17（KHTML，如Gecko）Chrome/24.0.1312.27 Safari/537.17」 –

如何使用python和BeautifulSoup無標籤地從HTML返回文本？

回答

相關問題