2012-01-05 83 views
4

使用以下代碼我試圖從我們的電話提供商的Web應用程序中刮取通話記錄,以將信息輸入到我的Ruby on Rails應用程序中。使用Nokogiri和Mechanize解析html表格

desc "Import incoming calls" 
task :fetch_incomingcalls => :environment do 

    # Logs into manage.phoneprovider.co.uk and retrieved list of incoming calls. 
    require 'rubygems' 
    require 'mechanize' 
    require 'logger' 

    # Create a new mechanize object 
    agent = Mechanize.new { |a| a.log = Logger.new(STDERR) } 

    # Load the Phone Provider website 
    page = agent.get("https://manage.phoneprovider.co.uk/login") 

    # Select the first form 
    form = agent.page.forms.first 
    form.username = 'username 
    form.password = 'password 

    # Submit the form 
    page = form.submit form.buttons.first 

    # Click on link called Call Logs 
    page = agent.page.link_with(:text => "Call Logs").click 

    # Click on link called Incoming Calls 
    page = agent.page.link_with(:text => "Incoming Calls").click 

    # Prints out table rows 
    # puts doc.css('table > tr') 

    # Print out the body as a test 
    # puts page.body 

end 

正如您可以從最後五行看到的,我測試了'puts page.body'成功工作並且上面的代碼有效。它成功登錄,然後導航到通話記錄,然後傳入Calls.The來電錶看起來像這樣:

| Timestamp | Source | Destination | Duration | 
| 03 Jan 13:40 | 12345678 | 12345679  | 00:01:01 |  
| 03 Jan 13:40 | 12345678 | 12345679  | 00:01:01 |  
| 03 Jan 13:40 | 12345678 | 12345679  | 00:01:01 |  
| 03 Jan 13:40 | 12345678 | 12345679  | 00:01:01 |  

這是從下面的代碼生成:

<thead> 
<tr> 
<td>Timestamp</td> 
<td>Source</td> 
<td>Destination</td> 
<td>Duration</td> 
<td>Cost</td> 
<td class='centre'>Recording</td> 
</tr> 
</thead> 
<tbody> 
<tr class='o'> 
<tr> 
<td>03 Jan 13:40</td> 
<td>12345678</td> 
<td>12345679</td> 
<td>00:01:14</td> 
<td></td> 
<td class='opt recording'> 
</td> 
</tr> 
</tr> 
<tr class='e'> 
<tr> 
<td>30 Dec 20:31</td> 
<td>12345678</td> 
<td>12345679</td> 
<td>00:02:52</td> 
<td></td> 
<td class='opt recording'> 
</td> 
</tr> 
</tr> 
<tr class='o'> 
<tr> 
<td>24 Dec 00:03</td> 
<td>12345678</td> 
<td>12345679</td> 
<td>00:00:09</td> 
<td></td> 
<td class='opt recording'> 
</td> 
</tr> 
</tr> 
<tr class='e'> 
<tr> 
<td>23 Dec 14:56</td> 
<td>12345678</td> 
<td>12345679</td> 
<td>00:00:07</td> 
<td></td> 
<td class='opt recording'> 
</td> 
</tr> 
</tr> 
<tr class='o'> 
<tr> 
<td>21 Dec 13:26</td> 
<td>07793770851</td> 
<td>12345679</td> 
<td>00:00:26</td> 
<td></td> 
<td class='opt recording'> 
</td> 
</tr> 
</tr> 

我想找出如何選擇我想要的單元格(時間戳,源,目標和持續時間)並輸出它們。然後我可以擔心輸出到數據庫而不是終端。

我試過使用Selector Gadget,但它只是顯示'td'或'tr:nth-​​child(6)td,tr:nth-​​child(2)td'如果我選擇多個。

任何幫助或指針,將不勝感激!

回答

10

表中有一種模式可以很容易地使用XPath。具有所需信息的行的<tr>標記缺少class屬性。幸運的是,XPath提供了一些簡單的邏輯操作,包括not()。這提供了我們需要的功能。

一旦我們減少了處理的行數,我們就可以遍歷行並通過使用XPath的element[n]選擇器來提取必要列的文本。這裏的一個重要注意事項是XPath對從1開始的元素進行計數,所以表格行的第一列應該是td[1]。通過引入nokogiri(和規格)

示例代碼:

require "rspec" 
require "nokogiri" 

HTML = <<HTML 
<table> 
    <thead> 
    <tr> 
     <td> 
     Timestamp 
     </td> 
     <td> 
     Source 
     </td> 
     <td> 
     Destination 
     </td> 
     <td> 
     Duration 
     </td> 
     <td> 
     Cost 
     </td> 
     <td class='centre'> 
     Recording 
     </td> 
    </tr> 
    </thead> 
    <tbody> 
    <tr class='o'> 
     <td></td> 
    </tr> 
    <tr> 
     <td> 
     03 Jan 13:40 
     </td> 
     <td> 
     12345678 
     </td> 
     <td> 
     12345679 
     </td> 
     <td> 
     00:01:14 
     </td> 
     <td></td> 
     <td class='opt recording'></td> 
    </tr> 
    <tr class='e'> 
     <td></td> 
    </tr> 
    <tr> 
     <td> 
     30 Dec 20:31 
     </td> 
     <td> 
     12345678 
     </td> 
     <td> 
     12345679 
     </td> 
     <td> 
     00:02:52 
     </td> 
     <td></td> 
     <td class='opt recording'></td> 
    </tr> 
    <tr class='o'> 
     <td></td> 
    </tr> 
    <tr> 
     <td> 
     24 Dec 00:03 
     </td> 
     <td> 
     12345678 
     </td> 
     <td> 
     12345679 
     </td> 
     <td> 
     00:00:09 
     </td> 
     <td></td> 
     <td class='opt recording'></td> 
    </tr> 
    <tr class='e'> 
     <td></td> 
    </tr> 
    <tr> 
     <td> 
     23 Dec 14:56 
     </td> 
     <td> 
     12345678 
     </td> 
     <td> 
     12345679 
     </td> 
     <td> 
     00:00:07 
     </td> 
     <td></td> 
     <td class='opt recording'></td> 
    </tr> 
    <tr class='o'> 
     <td></td> 
    </tr> 
    <tr> 
     <td> 
     21 Dec 13:26 
     </td> 
     <td> 
     07793770851 
     </td> 
     <td> 
     12345679 
     </td> 
     <td> 
     00:00:26 
     </td> 
     <td></td> 
     <td class='opt recording'></td> 
    </tr> 
    </tbody> 
</table> 
HTML 

class TableExtractor 
    def extract_data html 
    Nokogiri::HTML(html).xpath("//table/tbody/tr[not(@class)]").collect do |row| 
     timestamp = row.at("td[1]").text.strip 
     source  = row.at("td[2]").text.strip 
     destination = row.at("td[3]").text.strip 
     duration = row.at("td[4]").text.strip 
     {:timestamp => timestamp, :source => source, :destination => destination, :duration => duration} 
    end 
    end 
end 

describe TableExtractor do 
    before(:all) do 
    @html = HTML 
    end 

    it "should extract the timestamp properly" do 
    subject.extract_data(@html)[0][:timestamp].should eq "03 Jan 13:40" 
    end 

    it "should extract the source properly" do 
    subject.extract_data(@html)[0][:source].should eq "12345678" 
    end 

    it "should extract the destination properly" do 
    subject.extract_data(@html)[0][:destination].should eq "12345679" 
    end 

    it "should extract the duration properly" do 
    subject.extract_data(@html)[0][:duration].should eq "00:01:14" 
    end 

    it "should extract all informational rows" do 
    subject.extract_data(@html).count.should eq 5 
    end 
end 
+0

我不確定如何將這個代碼應用到我已有的代碼中,如果你看到以下的想法應該是我的想法..https: //gist.github.com/1574942 – dannymcc 2012-01-07 14:53:10

+0

直到現在才注意到您的回覆。我已經[分解了你的要點並添加了一些代碼](https://gist.github.com/1592493)。我也回答了你關於這個問題的其他問題。 – 2012-01-11 02:03:04

-1

使用XPath選擇器,您應該能夠從根目錄(最差的情況)到達所需的確切節點。與Nokogiri一起使用XPath列出了here

有關如何使用XPath訪問所有元素的詳細信息,請參閱here

+0

是文檔你只與f鏈接或解析XML數據,還是它也可以使用HTML網頁? – dannymcc 2012-01-06 17:59:43

+0

Yes.Check this one too nokogiri.org/tutorials/searching_a_xml_html_document.html – jake 2012-01-06 19:50:18