根據班級使用美麗的湯分離'td a'標籤

我想將此URL中的url鏈接寫入文件，但表格中每行有2個'td a'標籤。我只是想在其中一種class="pagelink"href="/search"等根據班級使用美麗的湯分離'td a'標籤

我嘗試下面的代碼，希望能拿起只有那些地方"class":"pagelink"，但產生的錯誤：

AttributeError: 'Doctype' object has no attribute 'find_all'

能

任何人幫助嗎？

import requests 
from bs4 import BeautifulSoup as soup 
import csv 

writer.writerow(['URL', 'Reference', 'Description', 'Address']) 

url = https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&PAGE=0&DISPLAY_COUNT=1000&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=city+of+edinburgh&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY#results 

response = session.get(url)     #not used until after the iteration begins 
html = soup(response.text, 'lxml') 

for link in html: 
    prop_link = link.find_all("td a", {"class":"pagelink"}) 

    writer.writerow([prop_link])

來源

2017-03-08 Odhran Hennessy

你html變量包含Doctype對象，它是不是可迭代。您需要在該對象中使用find_all或select來查找所需的節點。

例子：

import requests 
from bs4 import BeautifulSoup as soup 
import csv 

outputfilename = 'Ed_Streets2.csv' 

#inputfilename = 'Edinburgh.txt' 

baseurl = 'https://www.saa.gov.uk' 

outputfile = open(outputfilename, 'wb') 
writer = csv.writer(outputfile) 
writer.writerow(['URL', 'Reference', 'Description', 'Address']) 

session = requests.session() 

url = "https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&PAGE=0&DISPLAY_COUNT=100&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=city+of+edinburgh&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY#results" 

response = session.get(url)    
html = soup(response.text, 'lxml') 

prop_link = html.find_all("a", class_="pagelink button small") 

for link in prop_link: 
    prop_url = baseurl+(link["href"]) 
    print prop_url 
    writer.writerow([prop_url, "", "", ""])

來源

2017-03-08 15:35:56 Zroq

我繼續使用此代碼獲得相同的結果。它會打印每個href兩次（如每行中有2個href標籤）。難道是因爲第二個href上的class標籤是'a class =「pagelink button small」並且由於pagelink這個詞而繼續撿起它？ –

感謝您的回覆zroq –

我很抱歉 - 我的錯誤。我更新了代碼。請注意更改'html.find_all（「a」，class _ =「pagelink button small」） - 它現在會給出正確的輸出。 – Zroq

試試這個。
您需要在開始循環之前查找鏈接。

import requests 
from bs4 import BeautifulSoup as soup 
import csv 

writer.writerow(['URL', 'Reference', 'Description', 'Address']) 

url = "https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&PAGE=0&DISPLAY_COUNT=1000&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&ORIGINAL_SEARCH_TERM=city+of+edinburgh&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY#results" 

response = requests.get(url)     #not used until after the iteration begins 
html = soup(response.text, 'lxml') 

prop_link = html.find_all("a", {"class":"pagelink button small"}) 

for link in prop_link: 
    if(type(link) != type(None) and link.has_attr("href")): 
     wr = link["href"] 
     writer.writerow([wr])

來源

2017-03-08 15:31:04

感謝您的答覆，我繼續得到使用此代碼相同的結果。它打印每個href兩次（因爲每行有2個href標籤）。任何其他建議將非常受歡迎。 –

@OhhranHennessy我已經更新了代碼。它應該工作。 –

根據班級使用美麗的湯分離'td a'標籤

回答

相關問題