刮板無法打印所有結果

我已經在Python中編寫了一個腳本，用於從craigslist中刪除五個項目的「名稱」和「電話」。我面臨的問題是，當我運行我的腳本時，它只給出三個結果而不是五個結果。更具體地說，由於前兩個鏈接在他們的頁面中沒有附加鏈接（聯繫信息），所以他們不需要再打開任何附加頁面的請求。然而，沒有（聯繫信息）鏈接的這兩個鏈接無法通過我的第二個函數中的「if ano_page_link：」語句滲透並且從不打印。我該如何解決這個缺陷，以便它是否有電話號碼，刮板將打印所有五個結果。刮板無法打印所有結果

我，試圖腳本：

import re ; import requests ; from lxml import html 

base = "http://bangalore.craigslist.co.in" 

url_list = [ 
'http://bangalore.craigslist.co.in/reb/d/flat-is-for-sale-at-cooke-town/6266183606.html', 
'http://bangalore.craigslist.co.in/reb/d/prestige-sunnyside/6259128505.html', 
'http://bangalore.craigslist.co.in/reb/d/jayanagar-2nd-block-4000-sft/6221720477.html', 
'http://bangalore.craigslist.co.in/reb/d/prestige-ozone-type-3-r-villa/6259928614.html', 
'http://bangalore.craigslist.co.in/reb/d/zed-homes-3-bedroom-flat-for/6257075793.html' 
] 

def get_link(medium_link): 
    response = requests.get(medium_link).text 
    tree = html.fromstring(response) 
    try: 
     name = tree.cssselect('span#titletextonly')[0].text 
    except IndexError: 
     name = "" 
    try: 
     link = base + tree.cssselect('a.showcontact')[0].attrib['href'] 
    except IndexError: 
     link = "" 
    parse_doc(name, link) 

def parse_doc(title, ano_page_link): 

    if ano_page_link: 
     page = requests.get(ano_page_link).text    
     tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else "" 
     print(title, tel) 

if __name__ == '__main__': 
    for link in url_list: 
     get_link(link)

結果我有：

Jayanagar 2nd Block, 4000 sft Plot for Sale 9845012673 
PRESTIGE OZONE TYPE D 3 B/R VILLA FOR SALE 9611226364 
T ZED HOMES 3 BEDROOM FLAT FOR SALE 9611226364

結果我很期待：

A Flat is for sale at Cooke Town 
Prestige Sunnyside 
Jayanagar 2nd Block, 4000 sft Plot for Sale 9845012673 
PRESTIGE OZONE TYPE D 3 B/R VILLA FOR SALE 9611226364 
T ZED HOMES 3 BEDROOM FLAT FOR SALE 9611226364

來源

2017-08-28 SIM

你在'for'循環中做函數定義嗎？爲什麼？ – Andersson

對不起，先生。我不應該有。我爲此演示做了這個。 – SIM

按照您的建議修改了Andersson先生。 – SIM

需要注意的是，例如，在http://bangalore.craigslist.co.in/reb/d/flat-is-for-sale-at-cooke-town/6266183606.html沒有鏈接匹配'a.showcontact'選擇器，所以下面的塊

try: 
    link = base + tree.cssselect('a.showcontact')[0].attrib['href'] 
except IndexError: 
    link = ""

將返回link = ""

然後當你調用if ano_page_link:在if塊中的所有命令都被忽略的條件if ""是False並沒有打印出來

你可以試試下面來代替：

def parse_doc(title, ano_page_link): 

    if ano_page_link: 
     page = requests.get(ano_page_link).text    
     tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else "" 
     print(title, tel) 
    else: 
     print(title)

來源

2017-08-28 12:42:28 Andersson

謝謝先生安德森，爲您解答。它解決了這個問題。我還想過「其他」塊，但我的無知並沒有讓我這樣做。大聲笑!!。你是一個拯救生命的人。再次感謝主席先生。 – SIM

還有一件事要知道先生。在正常情況下，當寫數據在csv中，我會將該行放在打印語句「writer.writerow（[title，tel]）」之後或附近。但是，您能否建議我如何修改此行，因爲「標題」在這裏出現兩次聲明）。預先感謝。 – SIM

我不確定，因爲我有'.csv'的一些經驗，但是你可以嘗試類似'if ano_page_link：... writer.writerow（[title，tel]）' 'else：writer.writerow（[title，「」]）' – Andersson

您可以通過分離兩個任務來獲得更大的靈活性收集數據和打印數據。稍後想要擴展時，添加更多信息會更容易。

def collect_info(medium_link): 
    response = requests.get(medium_link).text 
    tree = html.fromstring(response) 

    title = get_title(tree) 
    contact_link = get_contact_link(tree) 
    tel = get_tel(contact_link) if contact_link else '' 

    return title, tel 


def get_title(tree): 
    try: 
     name = tree.cssselect('span#titletextonly')[0].text 
    except IndexError: 
     name = "" 
    return name 

def get_contact_link(tree): 
    try: 
     link = base + tree.cssselect('a.showcontact')[0].attrib['href'] 
    except IndexError: 
     link = "" 
    return link 

def get_tel(ano_page_link): 
    page = requests.get(ano_page_link).text 
    tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else "" 
    return tel 

def print_info(title, tel): 
    if tel: 
     fmt = 'Title: {title}, Phone: {tel}' 
    else: 
     fmt = 'Title: {title}' 
    print(fmt.format(title=title, tel=tel)) 

if __name__ == '__main__': 
    for link in url_list: 
     title, tel = collect_info(link) 
     print_info(title, tel)

來源

2017-08-28 13:04:07 CtheSky

你的工作也很有效，謝謝。 – SIM

刮板無法打印所有結果

回答

相關問題