Python：從特定的href（帶ID標籤）打印數據

我是Python的新手，並試圖構建我的第一個webscrapers之一。我想去一個頁面，打開一堆子頁面，在頁面上找到一個特定的鏈接（帶有一個ID），然後我想打印鏈接數據。現在我得到錯誤：'列表索引必須是整數，而不是str'，這意味着我在（至少）最後一行代碼中出錯了。Python：從特定的href（帶ID標籤）打印數據

我真的不確定的是，我需要做什麼來抓取和解析來自特定鏈接的href數據 - 因爲我認爲，剩下的工作（加載子頁面）。該刮刀（應該）搶丹麥公社的所有URL，並打印出來，所以打印的第一行應該是：

http://www.albertslund.dk

（由97多個跟隨）總之，這裏的代碼 - 希望任何人都可以告訴我，我做錯了什麼。預先感謝一堆。

from BeautifulSoup import BeautifulSoup 
from mechanize import Browser 

f = open("kommuneadresser.txt", "w") 
br = Browser() 
url = "https://bdkv2.borger.dk/foa/Sider/default.aspx?fk=22&foaid=11541520" 
page = br.open(url) 
html = page.read() 
soup = BeautifulSoup(html) 
link = soup.findAll('a') 
kommunelink = link[21:116] 

#we create a loop - for every single kommunelink in the list, 
#'something' will happen 
for kommune in kommunelink: 
    #the link-part is saved as a string 
    kommuneurl = kommune['href'] 
    #we construct a new url from two strings 
    fuldurl = "https://bdkv2.borger.dk/" + kommuneurl 
    #we open the page and save it in a variable 
    kommuneside = br.open(fuldurl) 
    #we read the page 
    html2 = kommuneside.read() 
    #we handle the page in beautifulsoup 
    soup2 = BeautifulSoup(html2) 
    #we find a specific link on the page 
    hjemmesidelink = soup2.findAll('a', attras={'ID':"uscAncHomesite"}) 
    print hjemmesidelink['href']

來源

2012-07-30 kabp

你能提供一個希望輸出的例子嗎？ – 2012-07-30 14:16:15

你可能想要修正這個縮進。實際上，很難判斷'for'循環中有多少代碼（例如）。 – mgilson 2012-07-30 14:16:38

感謝您的快速反饋。我想要到https://bdkv2.borger.dk/foa/Sider/default.aspx?fk=22&foaid=11541520，打開98個子頁面（undermyndigheder）並在hjemmeside（http：// www.albertslund.dk在98個社區中的第一個） – kabp 2012-07-30 14:29:02

首先，BeautifulSoup。 findAll（）返回List。

此外，你可能想要做soup2中的最後一個findAll。我不知道你會需要從hjemmesidelink該項目，以便嘗試一下本作的最後5行代碼：

#we handle the page in beautifulsoup 
soup2 = BeautifulSoup(html2) 
#we find a specific link on the page 
hjemmesidelink = soup2.findAll('a', attras={'ID':"uscAncHomesite"}) 
print hjemmesidelink

您將打印的第一個項目這樣

print hjemmesidelink[0]

來源

2012-07-30 14:19:59 bpgergo

感謝您的回覆！你是對的，它應該是soup2而不是湯，但它只打印空列表。輸出如下所示：[]（x98） – kabp 2012-07-30 14:34:00

任何想法爲什麼列表爲空？當我按照你的建議這樣做時，它只是返回一堆空列表，而我希望它返回一個url。 – kabp 2012-07-31 12:54:04

你試過嗎？

for link in soup.find_all('a'): 
    print(link.get('href'))

來源

2012-07-30 14:22:51

我試過了，它打印所有的網址，但我只需要它打印帶有「uscAncHomesite」ID的網址 – kabp 2012-07-30 14:47:27

你能告訴我輸出嗎？ – 2012-07-30 16:56:23

我得到looooooots的輸出，但我只需要一個具有特定ID的東西。輸出部分：的javascript：__ doPostBack（ 'ctl00 $ CTL12 $ btnToggleAccessibleMode'， ''） #sidetop #AccessabillityShortCutTopics ＃ctl00_PlaceHolderBDKSearchArea_ctl00_SimpleSearch_keyword #AccessabillityShortCutIndhold #AccessabillityShortCutVaelgkommune #AccessabillityShortCutShortCuts #AccessabillityShortCutBund / 無的javascript：__ doPostBack （'ctl00 $ ctl13 $ ctl02 $ LinkButton1'，''） / – kabp 2012-07-31 09:31:45

Python：從特定的href（帶ID標籤）打印數據

回答

相關問題