這應該做的工作。檢查了Python 3.6,但代碼應該是Python2.7兼容的。 主要想法是找到每年的鏈接,然後抓取每年的pdf,htm和txt文件的所有鏈接。
from __future__ import print_function
import requests
from bs4 import BeautifulSoup
def file_links_filter(tag):
"""
Tags filter. Return True for links that ends with 'pdf', 'htm' or 'txt'
"""
if isinstance(tag, str):
return tag.endswith('pdf') or tag.endswith('htm') or tag.endswith('txt')
def get_links(tags_list):
return [WEB_ROOT + tag.attrs['href'] for tag in tags_list]
def download_file(file_link, folder):
file = requests.get(file_link).content
name = file_link.split('/')[-1]
save_path = folder + name
print("Saving file:", save_path)
with open(save_path, 'wb') as fp:
fp.write(file)
WEB_ROOT = 'https://www.sec.gov'
SAVE_FOLDER = '~/download_files/' # directory in which files will be downloaded
r = requests.get("https://www.sec.gov/litigation/suspensions.shtml")
soup = BeautifulSoup(r.content, 'html.parser')
years = soup.select("p#archive-links > a") # css selector for all <a> inside <p id='archive'> tag
years_links = get_links(years)
links_to_download = []
for year_link in years_links:
page = requests.get(year_link)
beautiful_page = BeautifulSoup(page.content, 'html.parser')
links = beautiful_page.find_all("a", href=file_links_filter)
links = get_links(links)
links_to_download.extend(links)
# make set to exclude duplicate links
links_to_download = set(links_to_download)
print("Got links:", links_to_download)
for link in set(links_to_download):
download_file(link, SAVE_FOLDER)
什麼是你的腳本 – mtkilic
快速和骯髒的方式輸出:只是'grep的-o'像https://www.sec.gov/litigation/suspensions/2017/34-80766-所有URL o.pdf,並使用'wget'將它們全部下載 – zyxue
@mtkilic - 嗨,使用Denis的代碼後,我得到的輸出爲「Got links:set([])」。我無法下載這些文件。你能幫我弄清楚是什麼問題嗎? –