如何在Python中編寫Web代理

我正在嘗試在Python中編寫Web代理。我們的目標是訪問如下網址：http://proxyurl/http://anothersite.com/，並且像通常一樣查看他的內容http://anothersite.com。我通過濫用請求庫得到了相當不錯的結果，但這不是請求框架的預期用途。我以前寫過twisted的代理，但我不確定如何將其連接到我正在嘗試執行的操作。這裏就是我在那麼遠，如何在Python中編寫Web代理

import os 
import urlparse 

import requests 

import tornado.ioloop 
import tornado.web 
from tornado import template 

ROOT = os.path.dirname(os.path.abspath(__file__)) 
path = lambda *a: os.path.join(ROOT, *a) 

loader = template.Loader(path(ROOT, 'templates')) 


class ProxyHandler(tornado.web.RequestHandler): 
    def get(self, slug): 
     if slug.startswith("http://") or slug.startswith("https://"): 
      if self.get_argument("start", None) == "true": 
       parsed = urlparse.urlparse(slug) 
       self.set_cookie("scheme", value=parsed.scheme) 
       self.set_cookie("netloc", value=parsed.netloc) 
       self.set_cookie("urlpath", value=parsed.path) 
      #external resource 
      else: 
       response = requests.get(slug) 
       headers = response.headers 
       if 'content-type' in headers: 
        self.set_header('Content-type', headers['content-type']) 
       if 'length' in headers: 
        self.set_header('length', headers['length']) 
       for block in response.iter_content(1024): 
        self.write(block) 
       self.finish() 
       return 
     else: 
      #absolute 
      if slug.startswith('/'): 
       slug = "{scheme}://{netloc}{original_slug}".format(
        scheme=self.get_cookie('scheme'), 
        netloc=self.get_cookie('netloc'), 
        original_slug=slug, 
       ) 
      #relative 
      else: 
       slug = "{scheme}://{netloc}{path}{original_slug}".format(
        scheme=self.get_cookie('scheme'), 
        netloc=self.get_cookie('netloc'), 
        path=self.get_cookie('urlpath'), 
        original_slug=slug, 
       ) 
     response = requests.get(slug) 
     #get the headers 
     headers = response.headers 
     #get doctype 
     doctype = None 
     if '<!doctype' in response.content.lower()[:9]: 
      doctype = response.content[:response.content.find('>')+1] 
     if 'content-type' in headers: 
      self.set_header('Content-type', headers['content-type']) 
     if 'length' in headers: 
      self.set_header('length', headers['length']) 
     self.write(response.content) 


application = tornado.web.Application([ 
    (r"/(.+)", ProxyHandler), 
]) 

if __name__ == "__main__": 
    application.listen(8888) 
    tornado.ioloop.IOLoop.instance().start()

剛一說明，我設置cookie保存方案，netloc和urlpath如果有啓動= true在查詢字符串。這樣，任何相對或絕對鏈接，然後命中代理使用該cookie來解析完整的網址。

通過此代碼，如果您轉到http://localhost:8888/http://espn.com/?start=true，您將看到ESPN的內容。但是，在下面的網站上根本不起作用：http://www.bottegaveneta.com/us/shop/。我的問題是，最好的方法是什麼？目前我正在實施這個強大的方法還是有這樣做的一些可怕的陷阱？如果這是正確的，爲什麼像我指出的某些網站根本不工作？

謝謝你的幫助。

來源

2013-05-13 Kang Roodle

Bottega Veneta不允許您直接訪問資源。例如，嘗試點擊http://www.bottegaveneta.com/us/shop/css/bottegaveneta/form.css - 我得到一個HTML 404頁面。 – 2013-05-14 02:29:40

我猜這是與HTTP Referrer有關。你也可以嘗試設置。 – 2013-05-14 02:30:49

@Cole哦，你是指引用者？（https://en.wikipedia.org/wiki/HTTP_referer#Origin_of_the_term_referer） – rakslice 2013-10-04 01:32:42

我想你不需要你的最後一個塊。這似乎爲我工作。

class ProxyHandler(tornado.web.RequestHandler): 
    def get(self, slug): 
     print 'get: ' + str(slug) 

     if slug.startswith("http://") or slug.startswith("https://"): 
      if self.get_argument("start", None) == "true": 
       parsed = urlparse.urlparse(slug) 
       self.set_cookie("scheme", value=parsed.scheme) 
       self.set_cookie("netloc", value=parsed.netloc) 
       self.set_cookie("urlpath", value=parsed.path) 
      #external resource 
      else: 
       response = requests.get(slug) 
       headers = response.headers 
       if 'content-type' in headers: 
        self.set_header('Content-type', headers['content-type']) 
       if 'length' in headers: 
        self.set_header('length', headers['length']) 
       for block in response.iter_content(1024): 
        self.write(block) 
       self.finish() 
       return 
     else: 

      slug = "{scheme}://{netloc}/{original_slug}".format(
       scheme=self.get_cookie('scheme'), 
       netloc=self.get_cookie('netloc'), 
       original_slug=slug, 
      ) 
      print self.get_cookie('scheme') 
      print self.get_cookie('netloc') 
      print self.get_cookie('urlpath') 
      print slug 
     response = requests.get(slug) 
     #get the headers 
     headers = response.headers 
     #get doctype 
     doctype = None 
     if '<!doctype' in response.content.lower()[:9]: 
      doctype = response.content[:response.content.find('>')+1] 
     if 'content-type' in headers: 
      self.set_header('Content-type', headers['content-type']) 
     if 'length' in headers: 
      self.set_header('length', headers['length']) 
     self.write(response.content)

來源

2013-05-13 16:53:50

-3

您可以將用戶的請求模塊

import requests 

proxies = { 
    "http": "http://10.10.1.10:3128", 
    "https": "http://10.10.1.10:1080", 
} 

requests.get("http://example.org", proxies=proxies)

request docs

來源

2013-05-21 08:25:53 sinceq

爲什麼不是+1或更多？ – sinceq 2013-06-28 10:29:24

，因爲他試圖*寫*代理，而不是*使用*一個 – Xavier 2013-08-05 15:22:33

可以使用插座模塊中的標準庫，如果你使用的是Linux的epoll作爲好。

你可以看到一個簡單的異步服務器在這裏的示例代碼：https://github.com/aychedee/octopus/blob/master/octopus/server.py

來源

2013-08-03 18:14:56 aychedee

如果你想真正的代理，你可以使用：

tornado-proxy

或

simple proxy based on Twisted

但我認爲它不會很難適應你的情況。

來源

2013-08-26 14:57:37 shirk3y

我最近寫了一個類似的web應用程序。請注意，這是我做到這一點的方式。我不是說你應該這樣做。這些都是一些我碰到的陷阱：相對於絕對

更改屬性值有不只是抓取的網頁，並將其呈現給客戶更多地參與。很多時候，您無法在沒有任何錯誤的情況下代理網頁。

爲什麼像我指出的某些網站根本不工作？

許多網頁依賴資源的相對路徑以便以格式良好的方式顯示網頁。例如，下面的圖片代碼：

<img src="/header.png" />

將導致客戶做一個請求：

http://proxyurl/header.png

哪些失敗。該「SRC」值應轉換爲：

http://anothersite.com/header.png.

所以，你需要分析的東西，如BeautifulSoup，循環中的HTML文檔在所有的標籤併爲您的屬性，如：

'src', 'lowsrc', 'href'

而且改變他們的價值觀因此，這樣的標籤就變成了：

<img src="http://anothersite.com/header.png" />

此方法適用於更多標籤而不僅僅是圖片。一個，腳本，鏈接，李和框架是你應該改變以及一些。

HTML有心計

現有方法應該讓你走得很遠，但你還沒有完成。

兩個

<style type="text/css" media="all">@import "/stylesheet.css?version=120215094129002";</style>

而且

<div style="position:absolute;right:8px;background-image:url('/Portals/_default/Skins/BE/images/top_img.gif');height:200px;width:427px;background-repeat:no-repeat;background-position:right top;" >

是代碼，很難達到與使用BeautifulSoup修改的例子。

在第一個例子中，有一個css @Import給相對的uri。第二個涉及來自內聯CSS語句的'url（）'方法。

在我的情況下，我寫了可怕的代碼來手動修改這些值。你可能想使用正則表達式，但我不確定。

重定向

隨着Python的請求或urllib2的您可以輕鬆地遵循自動重定向。只要記住要保存新的（基本）uri;您將需要它來改變'從相對到絕對'的屬性值操作。

您還需要處理'硬編碼'重定向。如此一：

<meta http-equiv="refresh" content="0;url=http://new-website.com/">

需要改變到：

<meta http-equiv="refresh" content="0;url=http://proxyurl/http://new-website.com/">

基地標籤

的base tag指定基本URL /目標文檔中的所有相對URL。您可能想要更改該值。

最後完成了嗎？

沒有。一些網站嚴重依賴javascript來在屏幕上繪製他們的內容。這些網站是最難代理的。我一直在考慮使用類似PhantomJS或Ghost的內容來獲取和評估網頁並將結果呈現給客戶端。

也許我的source code可以幫到你。你可以用你想要的任何方式使用它。

來源

2013-11-01 15:10:50 cpb2

您可以在文檔頭中粘貼一個''標籤，以便一舉修正相關的URL。（但是，如果已經有一個！） – kindall 2013-11-01 15:38:40

我沒有想到！我會嘗試一下。謝謝！ – cpb2 2013-11-01 15:40:55

顯然我在回答這個問題時已經很晚了，但是剛剛偶然發現了它。我自己一直在寫類似於你的要求的東西。

它更像是一個HTTP轉發器，但它的第一個任務是代理本身。目前還不完全完整，目前還沒有讀過我的文章 - 但那些文章都在我的待辦事項清單上。

我已經使用mitmproxy來實現這一點。它可能不是那裏最優雅的一段代碼，我在這裏和那裏用了很多黑客來實現中繼器的功能。我知道默認情況下，mitmproxy有辦法很容易地實現中繼器thingy，但是在我無法使用mitmproxy提供的功能的情況下，有一些特定的要求。

您可能會在https://github.com/c0n71nu3/python_repeater/ 處找到該項目當我有任何進展時，回購仍在進行中。

希望它能夠爲您提供幫助。

來源

2015-09-01 11:28:59 qre0ct

如何在Python中編寫Web代理

回答

相關問題