2015-02-08 96 views
0

我需要通過在python中使用機械化登錄到網站,然後使用pycurl繼續遍歷該網站。所以我需要知道的是如何將通過機械化建立的登錄狀態轉換爲pycurl。我認爲這不僅僅是複製cookie。或者是?示例代碼的價值;)如何實現從機械化到pycurl的登錄切換

爲什麼我不願意單獨使用pycurl: 我有時間的限制,我的機械化代碼5分鐘修改this例如後工作如下:

import mechanize 
import cookielib 

# Browser 
br = mechanize.Browser() 

# Cookie Jar 
cj = cookielib.LWPCookieJar() 
br.set_cookiejar(cj) 

# Browser options 
br.set_handle_equiv(True) 
br.set_handle_gzip(True) 
br.set_handle_redirect(True) 
br.set_handle_referer(True) 
br.set_handle_robots(False) 

# Follows refresh 0 but not hangs on refresh > 0 
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1) 
# debugging messages? 
#br.set_debug_http(True) 
#br.set_debug_redirects(True) 
#br.set_debug_responses(True) 

# User-Agent (this is cheating, ok?) 
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] 

# Open the site 
r = br.open('https://thewebsite.com') 
html = r.read() 

# Show the source 
print html 
# or 
print br.response().read() 

# Show the html title 
print br.title() 

# Show the response headers 
print r.info() 
# or 
print br.response().info() 

# Show the available forms 
for f in br.forms(): 
    print f 

# Select the first (index zero) form 
br.select_form(nr=0) 

# Let's search 
br.form['username']='someusername' 
br.form['password']='somepwd' 
br.submit() 

print br.response().read() 

# Looking at some results in link format 
for l in br.links(url_regex='\.com'): 
    print l 

現在,如果我只能將正確的信息從br對象轉移到pycurl,那我就完成了。

爲什麼我不願意單獨使用機械化: 機械化是基於urllib和urllib是一個噩夢。我有太多的創傷問題。我可以吞下一兩個電話,以便登錄,但請不要再打。相比之下,pycurl已經證明對我來說穩定,可定製和快速。根據我的經驗,pycurl到urllib就像星際迷航到閃石。

PS:如果有人想知道,我用BeautifulSoup一旦我有

回答

0

解決它的HTML。它完全是關於cookie的。這裏是我的代碼來獲取餅乾:

import cookielib 
import mechanize 

def getNewLoginCookieFromSomeWebsite(username = 'someusername', pwd = 'somepwd'): 
    """ 
    returns a login cookie for somewebsite.com by using mechanize 
    """ 
    # Browser 
    br = mechanize.Browser() 

    # Cookie Jar 
    cj = cookielib.LWPCookieJar() 
    br.set_cookiejar(cj) 

    # Browser options 
    br.set_handle_equiv(True) 
    br.set_handle_gzip(True) 
    br.set_handle_redirect(True) 
    br.set_handle_referer(True) 
    br.set_handle_robots(False) 

    # Follows refresh 0 but does not hang on refresh > 0 
    br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1) 

    # User-Agent 
    br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:26.0) Gecko/20100101 Firefox/26.0')] 

    # Open login site 
    response = br.open('https://www.somewebsite.com') 

    # Select the first (index zero) form 
    br.select_form(nr=0) 

    # Enter credentials 
    br.form['user']=username 
    br.form['password']=pwd 
    br.submit() 

    cookiestr = "" 
    for c in br._ua_handlers['_cookies'].cookiejar: 
     cookiestr+=c.name+'='+c.value+';' 

    return cookiestr 

爲了激活cookie的使用,使用pycurl時,你所要做的就是輸入下面的發生c.perform()前:

c.setopt(pycurl.COOKIE, getNewLoginCookieFromSomeWebsite("username", "pwd")) 

請注意:有些網站可能會通過Set-Content與cookie保持交互,而pycurl(與機械化不同)不會自動執行對cookie的任何操作。 Pycurl簡單地接收字符串並且向用戶留下如何處理它。