2016-02-12 275 views
0

我試圖下載下面的鏈接中的所有PDF文件。如何提取網頁上的鏈接的URL

Link

首先,我試圖提取所有PDF鏈接(鏈接包含在紅色this image

from bs4 import BeautifulSoup 
import urllib2 as ul 

resp = ul.urlopen("https://www.osapublishing.org/search.cfm?q=comsol&meta=1&cj=1&cc=1") 
soup = BeautifulSoup(resp, 'lxml') 

f = open('url.txt', 'w') 

for link in soup.find_all('a', href=True): 

    f.write(str(link['href']) + '\n') 

f.close() 

---------------------------------------------------------------- 

<url.txt> 
http://www.osa.org 
# 
https://www.osapublishing.org 
# 
# 
# 
# 
/about.cfm 

/aop 
/ao 
/as 
/boe 
/col 
/jdt 
/jlt 
/jot 
/jocn 
/josaa 
/josab 
/josk 
/optica 
/ome 
/oe 
/ol 
/prj 
/jon 
/josa 
/on 
/aop 
/ao 
/as 
/boe 
/col 
/jdt 
/jlt 
/jot 
/jocn 
/josaa 
/josab 
/josk 
/optica 
/ome 
/oe 
/ol 
/prj 
/jon 
/josa 
/on 
/conferences.cfm 
/conferences.cfm 
/conferences.cfm?findby=conference 
/conference.cfm?meetingid=5 
/conference.cfm?meetingid=124 
/conference.cfm?meetingid=56 
/conference.cfm?meetingid=144&yr=2015 
/conference.cfm?meetingid=153&yr=2015 
/conference.cfm?meetingid=131&yr=2015 
/conference.cfm?meetingid=174&yr=2015 
/conference.cfm?meetingid=109&yr=2015 
#global-nav 
/books/lasers/lasers.cfm 
/oida/reports.cfm 
http://www.osa-opn.org 
/author/author.cfm 
/submit/review/peer_review.cfm 
/library/ 
/osadigitalarchive.cfm 
/isp.cfm 
http://imagebank.osa.org 
/spotlight 
/china/ 
# 
/user 
# 
# 
# 
https://www.osapublishing.org 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
/
# 
# 
/user 
# 
# 
/about.cfm 
/conferences.cfm 
/conferences.cfm 
/conferences.cfm?findby=conference 
/china/ 
/author/author.cfm 
/submit/review/peer_review.cfm 
/library/ 
/books/lasers/lasers.cfm 
/oida/reports.cfm 
http://www.osa-opn.org 
http://imagebank.osa.org 
/spotlight/ 
/china/ 
/about.cfm 
/benefitslog.cfm 
/contactus.cfm 
# 
/privacy.cfm 
/termsofuse.cfm 
https://account.osa.org/eweb/dynamicpage.aspx?sso=1&site=osac&webcode=loginrequired&url_success=https%3A%2F%2Fwww%2Eosapublishing%2Eorg%2Fsearch%2Ecfm%3Fq%3Dcomsol%26meta%3D1%26cj%3D1%26cc%3D1%26usertoken%3D%7Btoken%7D 
https://account.osa.org/eweb/Dynamicpage.aspx?webcode=forgotpassword*Site=osac 
/privacy.cfm 
http://www.osa.org/en-us/help/ 

但是,它看起來像我想提取WASN」鏈接的網址提取。
我該怎麼做?

+1

因此,你的目標是查看:PDF鏈接的權利?我看到的第一個是:'PDF'這可能意味着一些事情,它們是動態生成的或通過AJAX調用的。當我按照鏈接,我被帶到一個頁面,我登錄或購買。所以它不會直接將您帶到PDF中。你如何手動獲取PDF文件? – Twisty

+0

第二個加載一個完整的PDF在瀏覽器,它看起來像是動態生成的:https://www.osapublishing.org/view_article.cfm?gotourl=https%3A%2F%2Fwww.osapublishing.org%2FDirectPDFAccess%2F6FA37648 -E3C1-262B-6AF76128B6A12104_274099%2Foe-21-22-27371.pdf%3Fda%3D1%26id%3D274099%26seq%3D0%26mobile%3Dno與組織=我想補充一個條件來尋找你的腳本 'PDF'。 – Twisty

+0

謝謝你的回答。其中一些可以無需登錄即可下載。我知道這些鏈接的URL不在HTML源代碼中。有沒有辦法打開這些鏈接,而沒有他們的網址? –

回答

2

所有你想解決PDF鏈接是不是HTML的源內通過「https://www.osapublishing.org/search.cfm?q=comsol&meta=1&cj=1&cc=1」。

PDF鏈接正在通過AJAX加載。

我猜你需要打開與郵政和設置「中的」正確的參數/餅乾的URL。例如: 「CFID = XXXXXXXX; CFTOKEN = XXXXXXXX; BIGipServerPubsWeb_HTTP = xxxxxxxxx.xxxxx.xxxx; _ga = GAx.x.xxxxxxxxxx.xxxxxxxxxx; _gat = 1」

您的迴應將JSON格式。對象將包含'result [0] .data.has-pdf = true'來測試現有的PDF。鏈接看起來像:'fn:doc(「/ oe/21/22/27371/oe-21-22-27371.xml」)/ article/front/article-meta/abstract/p',所以你需要匹配它們到PDF文件。

,但我想他們可能有一些IP支票或其他安全的東西,所以也許你無法通過POST來自其他任何域,那麼原產地得到一些數據。只是一個猜測;)

+0

某些鏈接不需要登錄,例如:'https://www.osapublishing.org/view_article.cfm?gotourl=https%3A%2F%2Fwww.osapublishing.org%2FDirectPDFAccess%2F6FA37648-E3C1-262B- 6AF76128B6A12104_274099%2Foe-21-22-27371.pdf%3Fda%3D1%26id%3D274099%26seq%3D0%26mobile%3Dno&org ='在這裏你可以看到直接的URL被傳遞給CF腳本 – Twisty