2017-03-08 52 views
1

我想將此URL中的url鏈接寫入文件,但表格中每行有2個'td a'標籤。我只是想在其中一種class="pagelink"href="/search"根據班級使用美麗的湯分離'td a'標籤

我嘗試下面的代碼,希望能拿起只有那些地方"class":"pagelink",但產生的錯誤:

AttributeError: 'Doctype' object has no attribute 'find_all'

任何人幫助嗎?

import requests 
from bs4 import BeautifulSoup as soup 
import csv 

writer.writerow(['URL', 'Reference', 'Description', 'Address']) 

url = https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&PAGE=0&DISPLAY_COUNT=1000&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=city+of+edinburgh&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY#results 

response = session.get(url)     #not used until after the iteration begins 
html = soup(response.text, 'lxml') 

for link in html: 
    prop_link = link.find_all("td a", {"class":"pagelink"}) 

    writer.writerow([prop_link]) 

回答

3

html變量包含Doctype對象,它是不是可迭代。 您需要在該對象中使用find_allselect來查找所需的節點。

例子:

import requests 
from bs4 import BeautifulSoup as soup 
import csv 

outputfilename = 'Ed_Streets2.csv' 

#inputfilename = 'Edinburgh.txt' 

baseurl = 'https://www.saa.gov.uk' 

outputfile = open(outputfilename, 'wb') 
writer = csv.writer(outputfile) 
writer.writerow(['URL', 'Reference', 'Description', 'Address']) 

session = requests.session() 

url = "https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&PAGE=0&DISPLAY_COUNT=100&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=city+of+edinburgh&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY#results" 

response = session.get(url)    
html = soup(response.text, 'lxml') 

prop_link = html.find_all("a", class_="pagelink button small") 

for link in prop_link: 
    prop_url = baseurl+(link["href"]) 
    print prop_url 
    writer.writerow([prop_url, "", "", ""]) 
+0

我繼續使用此代碼獲得相同的結果。它會打印每個href兩次(如每行中有2個href標籤)。難道是因爲第二個href上的class標籤是'a class =「pagelink button small」並且由於pagelink這個詞而繼續撿起它? –

+0

感謝您的回覆zroq –

+0

我很抱歉 - 我的錯誤。我更新了代碼。請注意更改'html.find_all(「a」,class _ =「pagelink button small」) - 它現在會給出正確的輸出。 – Zroq

0

試試這個。
您需要在開始循環之前查找鏈接。

import requests 
from bs4 import BeautifulSoup as soup 
import csv 

writer.writerow(['URL', 'Reference', 'Description', 'Address']) 

url = "https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&PAGE=0&DISPLAY_COUNT=1000&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=city+of+edinburgh&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY#results" 

response = requests.get(url)     #not used until after the iteration begins 
html = soup(response.text, 'lxml') 

prop_link = html.find_all("a", {"class":"pagelink button small"}) 

for link in prop_link: 
    if(type(link) != type(None) and link.has_attr("href")): 
     wr = link["href"] 
     writer.writerow([wr]) 
+0

感謝您的答覆,我繼續得到使用此代碼相同的結果。它打印每個href兩次(因爲每行有2個href標籤)。任何其他建議將非常受歡迎。 –

+0

@OhhranHennessy我已經更新了代碼。它應該工作。 –