我已經在Python中編寫了一個腳本,用於從craigslist中刪除五個項目的「名稱」和「電話」。我面臨的問題是,當我運行我的腳本時,它只給出三個結果而不是五個結果。更具體地說,由於前兩個鏈接在他們的頁面中沒有附加鏈接(聯繫信息),所以他們不需要再打開任何附加頁面的請求。然而,沒有(聯繫信息)鏈接的這兩個鏈接無法通過我的第二個函數中的「if ano_page_link:」語句滲透並且從不打印。我該如何解決這個缺陷,以便它是否有電話號碼,刮板將打印所有五個結果。刮板無法打印所有結果
我,試圖腳本:
import re ; import requests ; from lxml import html
base = "http://bangalore.craigslist.co.in"
url_list = [
'http://bangalore.craigslist.co.in/reb/d/flat-is-for-sale-at-cooke-town/6266183606.html',
'http://bangalore.craigslist.co.in/reb/d/prestige-sunnyside/6259128505.html',
'http://bangalore.craigslist.co.in/reb/d/jayanagar-2nd-block-4000-sft/6221720477.html',
'http://bangalore.craigslist.co.in/reb/d/prestige-ozone-type-3-r-villa/6259928614.html',
'http://bangalore.craigslist.co.in/reb/d/zed-homes-3-bedroom-flat-for/6257075793.html'
]
def get_link(medium_link):
response = requests.get(medium_link).text
tree = html.fromstring(response)
try:
name = tree.cssselect('span#titletextonly')[0].text
except IndexError:
name = ""
try:
link = base + tree.cssselect('a.showcontact')[0].attrib['href']
except IndexError:
link = ""
parse_doc(name, link)
def parse_doc(title, ano_page_link):
if ano_page_link:
page = requests.get(ano_page_link).text
tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else ""
print(title, tel)
if __name__ == '__main__':
for link in url_list:
get_link(link)
結果我有:
Jayanagar 2nd Block, 4000 sft Plot for Sale 9845012673
PRESTIGE OZONE TYPE D 3 B/R VILLA FOR SALE 9611226364
T ZED HOMES 3 BEDROOM FLAT FOR SALE 9611226364
結果我很期待:
A Flat is for sale at Cooke Town
Prestige Sunnyside
Jayanagar 2nd Block, 4000 sft Plot for Sale 9845012673
PRESTIGE OZONE TYPE D 3 B/R VILLA FOR SALE 9611226364
T ZED HOMES 3 BEDROOM FLAT FOR SALE 9611226364
你在'for'循環中做函數定義嗎?爲什麼? – Andersson
對不起,先生。我不應該有。我爲此演示做了這個。 – SIM
按照您的建議修改了Andersson先生。 – SIM