0

我試圖抓取使用Python的請求圖書館網站,當我嘗試:如何在簡單的網頁抓取中停止302重定向?

r = requests.get('http://www.cell.com/cell-stem-cell/home', allow_redirects = False) 
>>> r.status_code 
302 
>>> r.text 
'The URL has moved <a href="https://secure.jbs.elsevierhealth.com/action/getSharedSiteSession?redirect=http%3A%2F%2Fwww.cell.com%2Fcell-stem-cell%2Fhome&rc=0&code=cell-site">here</a>\n' 

,當我嘗試:

>>> r = requests.get("https://secure.jbs.elsevierhealth.com/action/getSharedSiteSession?redirect=http%3A%2F%2Fwww.cell.com%2Fcell-stem-cell%2Fhome&rc=0&code=cell-site") 
>>> 
>>> r.text 
'\n\n\n\n\n<style type="text/css">\n .hidden {\n  display: none;\n  visibility: hidden;\n }\n</style>\n\n<!-- hidden iFrame for each of the SSO URLs -->\n<div class="hidden">\n \n  <iframe src="//acw.secure.jbs.elsevierhealth.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n \n  <iframe src="//acw.sciencedirect.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n \n  <iframe src="//acw.scopus.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n \n  <iframe src="//acw.sciverse.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n \n  <iframe src="//acw.mendeley.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n \n  <iframe src="//acw.elsevier.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n \n</div>\n\n\n\n<noscript>\n <a href="CANT POST LINK BECAUSE OF LACK OF REPUTATION POINTS OF STACK OVERFLOW">Redirect</a>\n</noscript>\n\n<!-- redirect to the product page after all iFrames are rendered -->\n<script>\n setTimeout(redirectFun,2000);\n var iFramesList = document.getElementsByTagName("iframe");\n var renderedIFramesCount = 0;\n var numberOfIFrames = iFramesList.length;\n for (var i = 0; i < iFramesList.length; i++) {\n  var iFrame = iFramesList[i];\n  bindEvent(iFrame, \'load\', function(){\n   renderedIFramesCount = renderedIFramesCount + 1;\n   if (renderedIFramesCount >= numberOfIFrames)\n   {\n    redirectFun();\n   }\n  });\n }\n var doRedirect = true;\n function redirectFun() {\n  if (doRedirect)\n   window.location.href = "CANT POST THIS WEBSITE BECAUSE OF MY REPUTATION POINTS ON STACKOVERFLOW";\n  doRedirect = false;\n }\n\n function bindEvent(el, eventName, eventHandler) {\n  if (el.addEventListener){\n   el.addEventListener(eventName, eventHandler, false);\n  } else if (el.attachEvent){\n   el.attachEvent(eventName, eventHandler);\n  }\n }\n</script>\n\n' 

我只想得到原始的HTML網站。

+0

注意到你所需要的數據:r.text寫着:該網址已經搬到這裏\ n,用「A HREF」的鏈接在第二GET請求 –

+0

嘗試使用硒。它模擬瀏覽器,所以它可能沒有這個問題。 –

回答

1

您必須根據請求標頭髮送用戶代理,以使網站相信請求來自真實的Web瀏覽器。所以,如果你想非重定向的URL的內容你的代碼應該是

from requests import get 
content = get('http://www.cell.com/cell-stem-cell/home', headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'},allow_redirects = False).content 
print content 

輸出將是:

The URL has moved <a href="https://secure.jbs.elsevierhealth.com/action/getShar 
edSiteSession?redirect=http%3A%2F%2Fwww.cell.com%2Fcell-stem-cell%2Fhome&rc=0&co 
de=cell-site">here</a> 

如果你想重定向的URL的內容,那麼允許重定向,但包括用戶代理標題。此方法適用於大多數不在其網站上使用動態內容的網站。如果你想抓取動態內容網站的數據,那麼你必須使用網頁瀏覽器模擬器,如selinium

+0

這肯定是有道理的,但你可以擴展「如果你想重定向的url的內容,然後允許重定向,但包括用戶代理標題。」,請? –

+0

如果你想要重定向的url的數據,那麼你的請求應該是'content = get('http://www.cell.com/cell-stem-cell/home',headers = {'User-agent':'Mozilla /5.0(Windows NT 6.1)AppleWebKit/537.36 (KHTML,如Gecko)Chrome/41.0.2228.0 Safari/537.36'})content「內容將具有重定向url的html源代碼。默認情況下,請求處理重定向,因此'allow_redirects = True'不是必需的。 – Mani

0

你只需要很少的工作來直接得到。服務器在需要重定向時發送位置標頭。您只需訪問該位置標題中的網址即可。

r = requests.get('http://www.cell.com/cell-stem-cell/home') 
if r.status_code==302: 
    r1 = requests.get(r.headers['Location']) 

你將不得不在r1.contentr1.text

+0

爲了處理上述代碼中的所有重定向,您可以像這樣檢查。 '如果r.status_code/100 == 3:'所以這將適用於所有重定向,如300,301,303,304,.... – Mani