我試圖下載下面的鏈接中的所有PDF文件。如何提取網頁上的鏈接的URL
首先,我試圖提取所有PDF鏈接(鏈接包含在紅色this image)
from bs4 import BeautifulSoup
import urllib2 as ul
resp = ul.urlopen("https://www.osapublishing.org/search.cfm?q=comsol&meta=1&cj=1&cc=1")
soup = BeautifulSoup(resp, 'lxml')
f = open('url.txt', 'w')
for link in soup.find_all('a', href=True):
f.write(str(link['href']) + '\n')
f.close()
----------------------------------------------------------------
<url.txt>
http://www.osa.org
#
https://www.osapublishing.org
#
#
#
#
/about.cfm
/aop
/ao
/as
/boe
/col
/jdt
/jlt
/jot
/jocn
/josaa
/josab
/josk
/optica
/ome
/oe
/ol
/prj
/jon
/josa
/on
/aop
/ao
/as
/boe
/col
/jdt
/jlt
/jot
/jocn
/josaa
/josab
/josk
/optica
/ome
/oe
/ol
/prj
/jon
/josa
/on
/conferences.cfm
/conferences.cfm
/conferences.cfm?findby=conference
/conference.cfm?meetingid=5
/conference.cfm?meetingid=124
/conference.cfm?meetingid=56
/conference.cfm?meetingid=144&yr=2015
/conference.cfm?meetingid=153&yr=2015
/conference.cfm?meetingid=131&yr=2015
/conference.cfm?meetingid=174&yr=2015
/conference.cfm?meetingid=109&yr=2015
#global-nav
/books/lasers/lasers.cfm
/oida/reports.cfm
http://www.osa-opn.org
/author/author.cfm
/submit/review/peer_review.cfm
/library/
/osadigitalarchive.cfm
/isp.cfm
http://imagebank.osa.org
/spotlight
/china/
#
/user
#
#
#
https://www.osapublishing.org
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
/
#
#
/user
#
#
/about.cfm
/conferences.cfm
/conferences.cfm
/conferences.cfm?findby=conference
/china/
/author/author.cfm
/submit/review/peer_review.cfm
/library/
/books/lasers/lasers.cfm
/oida/reports.cfm
http://www.osa-opn.org
http://imagebank.osa.org
/spotlight/
/china/
/about.cfm
/benefitslog.cfm
/contactus.cfm
#
/privacy.cfm
/termsofuse.cfm
https://account.osa.org/eweb/dynamicpage.aspx?sso=1&site=osac&webcode=loginrequired&url_success=https%3A%2F%2Fwww%2Eosapublishing%2Eorg%2Fsearch%2Ecfm%3Fq%3Dcomsol%26meta%3D1%26cj%3D1%26cc%3D1%26usertoken%3D%7Btoken%7D
https://account.osa.org/eweb/Dynamicpage.aspx?webcode=forgotpassword*Site=osac
/privacy.cfm
http://www.osa.org/en-us/help/
但是,它看起來像我想提取WASN」鏈接的網址提取。
我該怎麼做?
因此,你的目標是查看:PDF鏈接的權利?我看到的第一個是:'PDF'這可能意味着一些事情,它們是動態生成的或通過AJAX調用的。當我按照鏈接,我被帶到一個頁面,我登錄或購買。所以它不會直接將您帶到PDF中。你如何手動獲取PDF文件? – Twisty
第二個加載一個完整的PDF在瀏覽器,它看起來像是動態生成的:https://www.osapublishing.org/view_article.cfm?gotourl=https%3A%2F%2Fwww.osapublishing.org%2FDirectPDFAccess%2F6FA37648 -E3C1-262B-6AF76128B6A12104_274099%2Foe-21-22-27371.pdf%3Fda%3D1%26id%3D274099%26seq%3D0%26mobile%3Dno與組織=我想補充一個條件來尋找你的腳本 'PDF'。 – Twisty
謝謝你的回答。其中一些可以無需登錄即可下載。我知道這些鏈接的URL不在HTML源代碼中。有沒有辦法打開這些鏈接,而沒有他們的網址? –