2009-02-25 65 views
2

我試圖做webscraping的一點點,但WWW:機械化寶石似乎並不喜歡編碼和崩潰。
在302重定向POST請求的結果(其中機械化如下,到目前爲止好),並在結果頁似乎崩潰了。 我搜索了很多,但到目前爲止沒有提出如何解決這個問題。你們有沒有想法?語言Iconv :: IllegalSequence使用WWW時::機械化

代碼:

require 'rubygems' 
require 'mechanize' 

agent = WWW::Mechanize.new 

agent.user_agent_alias = 'Mac Safari' 
answer = agent.post('https://www.budget.de/de/reservierung/privatkunden/step1/schnellbuchung', 
{"Country" => "Deutschland", 
"Abholstation" => "Aalen", 
"Abgabestation" => "Aalen", 
"Abholdatum" => "26.02.2009", 
"Abholzeit_stunde" => "13", 
"Abholzeit_minute" => "30", 
"Abgabedatum" => "28.02.2009", 
"Abgabezeit_stunde" => "13", 
"Abgabezeit_minute" => "30", 
"CountryID" => "DE", 
"AbholstationID"=>"AA1", 
"AbgabestationID"=>"AA1" 
} 
) 
puts answer.body 

錯誤:

D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/util.rb:29:in `iconv': "\204nderungen vorbe"... (Iconv::IllegalSequence) 
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/util.rb:29:in `to_native_charset' 
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/response_header_handler.rb:29:in `handle' 
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:30:in `pass' 
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/handler.rb:6:in `handle' 
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/response_body_parser.rb:35:in `handle' 
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:30:in `pass' 
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/handler.rb:6:in `handle' 
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/pre_connect_hook.rb:14:in `handle' 
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:25:in `handle' 
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:494:in `fetch_page' 
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:545:in `fetch_page' 
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:403:in `post_form' 
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:322:in `post' 
from test.rb:7 

回答

3

該網頁肯定是UTF-8,但是機械化使用NKF(核心Ruby庫)猜測編碼和出於某種原因它出現了Shift JIS。解決此問題的最快方法是覆蓋Mechanize的編碼映射,以便當它嘗試使用Iconv將主體轉換爲UTF-8時,它也會以UTF-8的形式傳遞源編碼。你可以這樣說:剛行之後,你require的機械化庫

WWW::Mechanize::Util::CODE_DIC[:SJIS] = "UTF-8" 

廣場。您可能需要立即設置該值,或者甚至更好地找到問題的根本原因並在必要時提交補丁。

注:我解決這個問題的方法是通過使用回溯調試機械化庫。 to_native_charset方法調用detect_charset這是問題所在。

+0

太感謝你了! 解決了它:D – 2009-02-25 15:51:15

0

在我的情況下,Mechanize::File是由不使用編碼在所有get方法返回。
我能夠用手動Iconv轉換來解決它,但如果你知道的已編碼這僅適用。

result = @agent.get uri 
# Mechanize::File instead of Mechanize::Page is returned 
# so we have to convert manually 
result = Iconv.conv("utf-8", "iso-8859-1", result.body)