如何使用Beautiful Soup提取<script>標記中的函數字符串？

在一個給定的.html頁面中，我有一個腳本標籤，如下所示：如何使用美麗的湯提取「function getData（）」下的「retrun」信息？如何使用Beautiful Soup提取<script>標記中的函數字符串？

<script> 
 
function getData() 
 
{ 
 
\t return "zip,city,state,MedianIncome,MedianIncomeRank,CostOfLivingIndex,CostOfLivingRank\n10452,Bronx,NY,20606,2,147.7,74"; 
 
} 
 

 
function getResultsCount() 
 
{ 
 
\t return "1"; 
 
} 
 

 
</script>

來源

2016-12-06 jerry9855

的一種方式，可以說是最簡單的，是使用regular expression到兩個定位元件，並提取所需的字符串：

import re 

from bs4 import BeautifulSoup 

data = """ 
<script> 
function getData() 
{ 
    return "zip,city,state,MedianIncome,MedianIncomeRank,CostOfLivingIndex,CostOfLivingRank\n10452,Bronx,NY,20606,2,147.7,74"; 
} 

function getResultsCount() 
{ 
    return "1"; 
} 

</script> 
""" 

soup = BeautifulSoup(data, "html.parser") 

pattern = re.compile(r'return "(.*?)";$', re.MULTILINE | re.DOTALL) 
script = soup.find("script", text=pattern) 

print(pattern.search(script.text).group(1))

打印：

zip,city,state,MedianIncome,MedianIncomeRank,CostOfLivingIndex,CostOfLivingRank 
10452,Bronx,NY,20606,2,147.7,74

或者，您也可以使用JavaScript解析器，如slimit，示例here。

來源

2016-12-06 20:38:59 alecxe

當我更新下面的代碼時，出現錯誤（AttributeError：'NoneType'對象沒有屬性'text'）。 url =「http://zipwho.com/?zip=91709&city=&filters=--_--_--_&&state=&mode=zip」 data = urlopen（url）.read（） soup = BeautifulSoup（data，「html.parser」） – jerry9855

@ jerry9855首先，不應該將網址設爲http://zipwho.com/?zip=91709&city=&filters=--_--_--_-- ＆狀態=＆模式= zip'？另外，你應該從'html.parser'切換到'html5lib'（並且安裝了'html5lib'模塊）。 – alecxe

如何使用Beautiful Soup提取<script>標記中的函數字符串？

回答

相關問題