我想計算網頁的大小(以字節爲單位),例如www.google.com大小約爲44kb,facebook.com大約有17k。我曾嘗試過Nokogiri來計算HTML的長度,但它給了Google 8k和Facebook的32k。我不想使用任何第三方工具,我想在我的應用程序中計算它。Ruby代碼獲取網頁的字節大小
3
A
回答
3
此代碼示例應該會幫助您。它下載網站,並使用長度方法來檢索大小。
require 'net/http'
require 'fileutils' #I'm pretty sure this is needed for the delete method
class HttpSample
def downloadGoogleHome
proxy = Net::HTTP::Proxy('ipaddress', portnumber) # use actual ip and port
url = URI.parse('http://www.google.com')
http_response = proxy.get_response(url)
puts http_response.body.length #size
end
s = HttpSample.new
s.downloadGoogleHome
end
1
使用Net::HTTP::Head
允許你問一個網頁服務器沒有它不必返回頁面和浪費自己的,你的,帶寬和CPU時間。其中一個頭迴應該是Content-Length
:
require 'net/http'
request = Net::HTTP.new('google.com', 80)
head = request.request_head('/')
回報:
#<Net::HTTPMovedPermanently:0x102157ae0
@body_exist = false,
@read = true,
@socket = nil,
attr_accessor :body = nil,
attr_reader :code = "301",
attr_reader :header = {
"location" => [
[0] "http://www.google.com/"
],
"content-type" => [
[0] "text/html; charset=UTF-8"
],
"date" => [
[0] "Thu, 26 Jul 2012 17:46:30 GMT"
],
"expires" => [
[0] "Sat, 25 Aug 2012 17:46:30 GMT"
],
"cache-control" => [
[0] "public, max-age=2592000"
],
"server" => [
[0] "gws"
],
"content-length" => [
[0] "219"
],
"x-xss-protection" => [
[0] "1; mode=block"
],
"x-frame-options" => [
[0] "SAMEORIGIN"
],
"connection" => [
[0] "close"
]
},
attr_reader :http_version = "1.1",
attr_reader :message = "Moved Permanently"
>
這是一個重定向,顯示了瀏覽器需要到別處。
不幸的是,並不是所有的HTTPd都返回content-length
頭,因爲頁面可能是動態創建的,所以在內容實際上已經被渲染和發送之前,它不能作出很好的猜測。
按照上述與另一個HEAD請求導致重定向:
#<Net::HTTPOK:0x10217e8c0
@body_exist = false,
@read = true,
@socket = nil,
attr_accessor :body = nil,
attr_reader :code = "200",
attr_reader :header = {
"set-cookie" => [
[ 0] "NID=62=c2jRl25ItoF5YkVgNv3g2woB2A3iIqkY__EYX5BGst--KYmjNbfCeVL0FIUcq6jm6PqH_-YV6QFO_yNjy1BzMms-QJKPRsfcq0px030WVzKTMtMF9dJUJpS0XdV1NLOv; expires=Fri, 25-Jan-2013 17:50:22 GMT; path=/; domain=.google.com; HttpOnly",
[ 1] "expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com",
[ 2] "path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com",
[ 3] "domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com",
[ 4] "expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com",
[ 5] "path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com",
[ 6] "domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com",
[ 7] "expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com",
[ 8] "path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com",
[ 9] "domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com",
[10] "expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com",
[11] "path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com",
[12] "domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com",
[13] "expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com",
[14] "path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com",
[15] "domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com",
[16] "expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com",
[17] "path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com",
[18] "domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com",
[19] "expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com",
[20] "path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com",
[21] "domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com",
[22] "expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com",
[23] "path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com",
[24] "domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com",
[25] "PREF=ID=51ce2f15ffbc5de1:FF=0:TM=1343325022:LM=1343325022:S=H8-1NoxuEbX7fepF; expires=Sat, 26-Jul-2014 17:50:22 GMT; path=/; domain=.google.com",
[26] "NID=62=aO6oBKx_v48l5SqQrRDUiNxfOixEE0QnkQIBSZK4u0xS8cHGc7uXTUt6yJhIZTyCe_XWGn6t3-Ov4EvxPE8hAO7I89ao9RR9dLUyYPBB784fR12bJsqbkTaCVaZI7ihT; expires=Fri, 25-Jan-2013 17:50:22 GMT; path=/; domain=.google.com; HttpOnly"
],
"date" => [
[0] "Thu, 26 Jul 2012 17:50:22 GMT"
],
"expires" => [
[0] "-1"
],
"cache-control" => [
[0] "private, max-age=0"
],
"content-type" => [
[0] "text/html; charset=ISO-8859-1"
],
"p3p" => [
[0] "CP=\"This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info.\""
],
"server" => [
[0] "gws"
],
"x-xss-protection" => [
[0] "1; mode=block"
],
"x-frame-options" => [
[0] "SAMEORIGIN"
],
"connection" => [
[0] "close"
]
},
attr_reader :http_version = "1.1",
attr_reader :message = "OK"
>
通知,未返回content-length
報頭。
會返回一個靜態頁面給了我一個不同的響應站點:
request = Net::HTTP.new('tools.ietf.org', 80)
head = request.request_head('/html/rfc2606')
回報:
#<Net::HTTPOK:0x100914370
@body_exist = false,
@read = true,
@socket = nil,
attr_accessor :body = nil,
attr_reader :code = "200",
attr_reader :header = {
"date" => [
[0] "Thu, 26 Jul 2012 17:55:23 GMT"
],
"server" => [
[0] "Apache/2.2.21 (Debian)"
],
"content-location" => [
[0] "rfc2606.html"
],
"vary" => [
[0] "negotiate"
],
"tcn" => [
[0] "choice"
],
"last-modified" => [
[0] "Sat, 26 May 2012 22:18:00 GMT"
],
"etag" => [
[0] "\"d44ff-43da-4c0f7db90d600;4c5bf43471540\""
],
"accept-ranges" => [
[0] "bytes"
],
"content-length" => [
[0] "17370"
],
"connection" => [
[0] "close"
],
"content-type" => [
[0] "text/html; charset=UTF-8"
]
},
attr_reader :http_version = "1.1",
attr_reader :message = "OK"
>
所以,是的,這是可能的說,但有時候你不能從HEAD
請求中獲取所需的信息。
過去,我的解決方法是首先嚐試HEAD,如果沒有給我我需要的東西,那麼我會使用普通的GET檢索頁面,然後將大小從它。這有助於減少浪費的帶寬。
相關問題
- 1. 獲取頁碼和頁面大小
- 2. 如何獲取網頁中元素的文字字體大小
- 3. 獲取數據類型字節大小
- 4. 在網頁上獲取圖像大小
- 5. 使用java代碼獲取ResultSet的大小(以字節爲單位)
- 6. 無法獲取網頁的源代碼
- 7. 如何獲取網頁上字體的大小?
- 8. 獲取HTTPRequest/HTTPResponse頭字節大小(以字節爲單位)
- 9. 如何從網頁獲取html代碼?
- 10. Arduino以太網字節大小問題
- 11. 如何創建代碼段的大小應該在16 mb(以字節代碼)大小C++程序
- 12. 獲取JavaScript字節碼
- 13. Java獲取字體大小
- 14. 獲取字體大小
- 15. 網頁的字體大小和行高
- 16. 從指針數組中獲取特定字節數組的大小爲字節
- 17. 從字節碼獲取代碼操作數的名稱
- 18. PHP代碼讀取網頁的源代碼並從標籤獲取屬性
- 19. 解碼後位圖字節大小?
- 20. 獲取頁面源代碼
- 21. 獲取網站Xcode大小的問題
- 22. 獲取列表的字節大小<T>
- 23. 如何在iOS 4.0+中獲取字節大小的UIImage?
- 24. 以兆字節獲取準確的文件大小?
- 25. 如何使用node.js獲取圖像文件的字節大小
- 26. PHP獲取網頁的大小,包括圖片,CSS和Javascript
- 27. 大小以字節
- 28. 獲取Firefox的擴展頁面大小
- 29. 網頁沒有調整大小 - 代碼沒有限制
- 30. 從網址抓取圖片並保存,大小爲0字節
如果HTML是/可以通過gzip通過'net'發送,那麼您是希望gzip數據的大小還是來自響應的原始未壓縮大小? – Phrogz 2012-07-26 16:36:32
Nokogiri不是使用此工具的工具。它只是一個XML/HTML解析器。 – 2012-07-26 17:26:44