如何提取網頁上的鏈接的URL

我試圖下載下面的鏈接中的所有PDF文件。如何提取網頁上的鏈接的URL

首先，我試圖提取所有PDF鏈接（鏈接包含在紅色this image）

from bs4 import BeautifulSoup 
import urllib2 as ul 

resp = ul.urlopen("https://www.osapublishing.org/search.cfm?q=comsol&meta=1&cj=1&cc=1") 
soup = BeautifulSoup(resp, 'lxml') 

f = open('url.txt', 'w') 

for link in soup.find_all('a', href=True): 

    f.write(str(link['href']) + '\n') 

f.close() 

---------------------------------------------------------------- 

<url.txt> 
http://www.osa.org 
# 
https://www.osapublishing.org 
# 
# 
# 
# 
/about.cfm 

/aop 
/ao 
/as 
/boe 
/col 
/jdt 
/jlt 
/jot 
/jocn 
/josaa 
/josab 
/josk 
/optica 
/ome 
/oe 
/ol 
/prj 
/jon 
/josa 
/on 
/aop 
/ao 
/as 
/boe 
/col 
/jdt 
/jlt 
/jot 
/jocn 
/josaa 
/josab 
/josk 
/optica 
/ome 
/oe 
/ol 
/prj 
/jon 
/josa 
/on 
/conferences.cfm 
/conferences.cfm 
/conferences.cfm?findby=conference 
/conference.cfm?meetingid=5 
/conference.cfm?meetingid=124 
/conference.cfm?meetingid=56 
/conference.cfm?meetingid=144&yr=2015 
/conference.cfm?meetingid=153&yr=2015 
/conference.cfm?meetingid=131&yr=2015 
/conference.cfm?meetingid=174&yr=2015 
/conference.cfm?meetingid=109&yr=2015 
#global-nav 
/books/lasers/lasers.cfm 
/oida/reports.cfm 
http://www.osa-opn.org 
/author/author.cfm 
/submit/review/peer_review.cfm 
/library/ 
/osadigitalarchive.cfm 
/isp.cfm 
http://imagebank.osa.org 
/spotlight 
/china/ 
# 
/user 
# 
# 
# 
https://www.osapublishing.org 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
# 
/
# 
# 
/user 
# 
# 
/about.cfm 
/conferences.cfm 
/conferences.cfm 
/conferences.cfm?findby=conference 
/china/ 
/author/author.cfm 
/submit/review/peer_review.cfm 
/library/ 
/books/lasers/lasers.cfm 
/oida/reports.cfm 
http://www.osa-opn.org 
http://imagebank.osa.org 
/spotlight/ 
/china/ 
/about.cfm 
/benefitslog.cfm 
/contactus.cfm 
# 
/privacy.cfm 
/termsofuse.cfm 
https://account.osa.org/eweb/dynamicpage.aspx?sso=1&site=osac&webcode=loginrequired&url_success=https%3A%2F%2Fwww%2Eosapublishing%2Eorg%2Fsearch%2Ecfm%3Fq%3Dcomsol%26meta%3D1%26cj%3D1%26cc%3D1%26usertoken%3D%7Btoken%7D 
https://account.osa.org/eweb/Dynamicpage.aspx?webcode=forgotpassword*Site=osac 
/privacy.cfm 
http://www.osa.org/en-us/help/

但是，它看起來像我想提取WASN」鏈接的網址提取。
我該怎麼做？

來源

2016-02-12 Harutaka Kawamura

因此，你的目標是查看：PDF鏈接的權利？我看到的第一個是：'PDF'這可能意味着一些事情，它們是動態生成的或通過AJAX調用的。當我按照鏈接，我被帶到一個頁面，我登錄或購買。所以它不會直接將您帶到PDF中。你如何手動獲取PDF文件？ – Twisty

第二個加載一個完整的PDF在瀏覽器，它看起來像是動態生成的：https://www.osapublishing.org/view_article.cfm?gotourl=https%3A%2F%2Fwww.osapublishing.org%2FDirectPDFAccess%2F6FA37648 -E3C1-262B-6AF76128B6A12104_274099％2Foe-21-22-27371.pdf％3Fda％3D1％26id％3D274099％26seq％3D0％26mobile％3Dno與組織=我想補充一個條件來尋找你的腳本 'PDF'。 – Twisty

謝謝你的回答。其中一些可以無需登錄即可下載。我知道這些鏈接的URL不在HTML源代碼中。有沒有辦法打開這些鏈接，而沒有他們的網址？ –

所有你想解決PDF鏈接是不是HTML的源內通過「https://www.osapublishing.org/search.cfm?q=comsol&meta=1&cj=1&cc=1」。

PDF鏈接正在通過AJAX加載。

我猜你需要打開與郵政和設置「中的」正確的參數/餅乾的URL。例如：「CFID = XXXXXXXX; CFTOKEN = XXXXXXXX; BIGipServerPubsWeb_HTTP = xxxxxxxxx.xxxxx.xxxx; _ga = GAx.x.xxxxxxxxxx.xxxxxxxxxx; _gat = 1」

您的迴應將JSON格式。對象將包含'result [0] .data.has-pdf = true'來測試現有的PDF。鏈接看起來像：'fn：doc（「/ oe/21/22/27371/oe-21-22-27371.xml」）/ article/front/article-meta/abstract/p'，所以你需要匹配它們到PDF文件。

，但我想他們可能有一些IP支票或其他安全的東西，所以也許你無法通過POST來自其他任何域，那麼原產地得到一些數據。只是一個猜測;）

來源

2016-02-13 00:06:23 tomtaylor

某些鏈接不需要登錄，例如：'https://www.osapublishing.org/view_article.cfm?gotourl=https%3A%2F%2Fwww.osapublishing.org%2FDirectPDFAccess%2F6FA37648-E3C1-262B- 6AF76128B6A12104_274099％2Foe-21-22-27371.pdf％3Fda％3D1％26id％3D274099％26seq％3D0％26mobile％3Dno＆org ='在這裏你可以看到直接的URL被傳遞給CF腳本 – Twisty

如何提取網頁上的鏈接的URL

回答

相關問題