Python |網絡爬蟲|我是否正確使用它？

因此，我正在研究Python，就像我很久以前看過Python一樣，並沒有太多的學習語言，現在，我正在重新研究它。Python |網絡爬蟲|我是否正確使用它？

我現在正在研究的是網絡爬蟲，但我不確定是否正確，我認爲我正在尋找這個項目..請糾正我，如果我錯了，但這是我想要的項目

我想編寫一個程序，在那裏我可以簡單地啓動它，並輸入一個網站的網址（特定或完整的網站），它會掃描它的嵌入/ iFrame代碼，並將鏈接下載到表格中，例如：

Page Title - | | - iFrame的Found- |＃ -Embed1- -/Embed1- | -Embed2- -/Embed2- 等等。

我在尋找正確的語言和方面，還是應該爲此尋找其他的東西？

非常感謝您的任何反饋/支持！

來源

2017-03-09 IndieGuts

[scrapy]（https://scrapy.org/）就是你要找的東西。 –

有多種方法來刮取網站。這是一個使用BeautifulSoup的例子。
可以使用
pip install python-bs4爲windows
apt-get install python-bs4安裝BeautifulSoup爲linux

可以上手here

工作代碼

from bs4 import BeautifulSoup 
import urllib 
r = urllib.urlopen('http://www.aflcio.org/Legislation-and-Politics/Legislative-Alerts').read() 
soup = BeautifulSoup(r) 
print soup.prettify()[0:1000]

輸出：

<class 'bs4.BeautifulSoup'> 
<!DOCTYPE html> 
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]--> 
<!--[if IE 7]> <html class="no-js ie7 oldie" lang="en-US"> <![endif]--> 
<!--[if IE 8]> <html class="no-js ie8 oldie" lang="en-US"> <![endif]--> 
<!--[if gt IE 8]><!--> 
<html class="no-js" lang="en-US"> 
<!--<![endif]--> 
<head> 
    <title> 
    Access denied | www.aflcio.org used Cloudflare to restrict access 
    </title> 
    <meta charset="utf-8"/> 
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/> 
    <meta content="IE=Edge,chrome=1" http-equiv="X-UA-Compatible"/> 
    <meta content="noindex, nofollow" name="robots"/> 
    <meta content="width=device-width,initial-scale=1,maximum-scale=1" name="viewport"/> 
    <link href="/cdn-cgi/styles/cf.errors.css" id="cf_styles-css" media="screen,projection" rel="stylesheet" type="text/css"/> 
    <!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" type="text/css" media="screen,projection" /><![endif]-- 
>>>

您可以播放輸出以過濾所需的內容，如iFrame。更多詳細信息here。

來源

2017-03-09 03:51:42

真棒，這正是我正在尋找的，但是當我嘗試運行「pip install python-bs4」時，出現此錯誤：未能找到滿足需求python-bs4的版本（來自版本:) 沒有匹配發現爲python-bs4 （我在Windows 10上）編輯：得到它「pip安裝beautifulsoup4」 – IndieGuts

很高興提供幫助。如果我的答案解決了你的問題，你介意接受答案嗎？這樣，這個問題就不存在，因爲沒有答案 –

Python |網絡爬蟲|我是否正確使用它？

回答

相關問題