2017-07-07 82 views
1

我是wsj的付費會員,我試圖取消文章以執行我的NLP項目。我以爲我保留了會議。通過請求,CURL和BeautifulSoup形成wsj的廢品文章

rs = requests.session() 
login_url="https://sso.accounts.dowjones.com/login?client=5hssEAdMy0mJTICnJNvC9TXEw3Va7jfO&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin&scope=openid%20idp_id&response_type=code&nonce=18091b1f-2c73-4a93-ab10-77b0d4d4f9d3&connection=DJldap&ui_locales=en-us-x-wsj-3&mg=prod%2Faccounts-wsj&state=NfljSw-Gz-TnT_I6kLjnTa2yxy8akTui#!/signin" 
payload={ 
    "username":"[email protected]", 
    "password":"myPassword", 
} 
result = rs.post(
    login_url, 
    data = payload, 
    headers = dict(referer=login_url) 
) 

這篇文章我想解析。

r = rs.get('https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y') 

然後我發現的HTML仍然是一個非會員

我還通過使用CURL保存Cookie後我登錄

curl -c cookies.txt -I "https://www.wsj.com" 
curl -v cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y" > test.html 

結果是嘗試的另一種方法相同。

我並不十分熟悉驗證機制如何在瀏覽器後面工作。有人可以解釋爲什麼上述兩種方法都失敗了,我應該如何解決它以實現我的目標。非常感謝你。

回答

0

您的嘗試失敗,因爲使用的協議是oauth2.0。這不是基本的認證。

這裏發生的事情是:

  • 一些信息生成服務器端時登錄URL https://accounts.wsj.com/login被稱爲:connection & client_id
  • 提交的用戶名/密碼時,https://sso.accounts.dowjones.com/usernamepassword/login被稱爲URL這需要一些參數(前面的connection & client_id + oauth2的一些靜態參數:scope,response_type,redirect_uri
  • 收到來自上次登錄呼叫的響應,該呼叫給出了自動提交的表單。該表格有3個參數wawresultwctxwresultJWT)。這種形式進行調用https://sso.accounts.dowjones.com/login/callback與代碼PARAM像code=AjKK8g0pZZfvYpju
  • 的URL https://accounts.wsj.com/auth/sso/login?code=AjKK8g0pZZfvYpju被稱爲與一個有效的用戶會話

它採用curlgreppup的bash腳本檢索餅乾檢索網址和jq

username="[email protected]" 
password="YourPassword" 

login_url=$(curl -s -I "https://accounts.wsj.com/login") 
connection=$(echo "$login_url" | grep -oP "Location:\s+.*connection=\K(\w+)") 
client_id=$(echo "$login_url" | grep -oP "Location:\s+.*client_id=\K(\w+)") 

#connection=$(echo "$login_url" | gawk 'match($0, /Location:\s+.*connection=(\w+)&/, data) {print data[1]}') 
#client_id=$(echo "$login_url" | gawk 'match($0, /Location:\s+.*client_id=(\w+)&/, data) {print data[1]}') 

rm -f cookies.txt 

IFS='|' read -r wa wresult wctx < <(curl -s 'https://sso.accounts.dowjones.com/usernamepassword/login' \ 
     --data-urlencode "username=$username" \ 
     --data-urlencode "password=$password" \ 
     --data-urlencode "connection=$connection" \ 
     --data-urlencode "client_id=$client_id" \ 
     --data 'scope=openid+idp_id&tenant=sso&response_type=code&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin' | pup 'input json{}' | jq -r 'map(.value) | join("|")') 

# replace double quote "" 
wctx=$(echo "$wctx" | sed 's/&#34;/"/g') 

code_url=$(curl -D - -s -c cookies.txt 'https://sso.accounts.dowjones.com/login/callback' \ 
    --data-urlencode "wa=$wa" \ 
    --data-urlencode "wresult=$wresult" \ 
    --data-urlencode "wctx=$wctx" | grep -oP "Location:\s+\K(\S*)") 

curl -s -c cookies.txt "$code_url" 

# here call your URL loading cookies.txt 
curl -s -b cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y" 
+0

我在網絡中觀看標題,並複製這些cookie,它的工作原理。你的解釋更清楚地說明網絡背後發生了什麼 – Netjimmy