使用Nokogiri和Mechanize解析html表格

使用以下代碼我試圖從我們的電話提供商的Web應用程序中刮取通話記錄，以將信息輸入到我的Ruby on Rails應用程序中。使用Nokogiri和Mechanize解析html表格

desc "Import incoming calls" 
task :fetch_incomingcalls => :environment do 

    # Logs into manage.phoneprovider.co.uk and retrieved list of incoming calls. 
    require 'rubygems' 
    require 'mechanize' 
    require 'logger' 

    # Create a new mechanize object 
    agent = Mechanize.new { |a| a.log = Logger.new(STDERR) } 

    # Load the Phone Provider website 
    page = agent.get("https://manage.phoneprovider.co.uk/login") 

    # Select the first form 
    form = agent.page.forms.first 
    form.username = 'username 
    form.password = 'password 

    # Submit the form 
    page = form.submit form.buttons.first 

    # Click on link called Call Logs 
    page = agent.page.link_with(:text => "Call Logs").click 

    # Click on link called Incoming Calls 
    page = agent.page.link_with(:text => "Incoming Calls").click 

    # Prints out table rows 
    # puts doc.css('table > tr') 

    # Print out the body as a test 
    # puts page.body 

end

正如您可以從最後五行看到的，我測試了'puts page.body'成功工作並且上面的代碼有效。它成功登錄，然後導航到通話記錄，然後傳入Calls.The來電錶看起來像這樣：

| Timestamp | Source | Destination | Duration | 
| 03 Jan 13:40 | 12345678 | 12345679  | 00:01:01 |  
| 03 Jan 13:40 | 12345678 | 12345679  | 00:01:01 |  
| 03 Jan 13:40 | 12345678 | 12345679  | 00:01:01 |  
| 03 Jan 13:40 | 12345678 | 12345679  | 00:01:01 |

這是從下面的代碼生成：

<thead> 
<tr> 
<td>Timestamp</td> 
<td>Source</td> 
<td>Destination</td> 
<td>Duration</td> 
<td>Cost</td> 
<td class='centre'>Recording</td> 
</tr> 
</thead> 
<tbody> 
<tr class='o'> 
<tr> 
<td>03 Jan 13:40</td> 
<td>12345678</td> 
<td>12345679</td> 
<td>00:01:14</td> 
<td></td> 
<td class='opt recording'> 
</td> 
</tr> 
</tr> 
<tr class='e'> 
<tr> 
<td>30 Dec 20:31</td> 
<td>12345678</td> 
<td>12345679</td> 
<td>00:02:52</td> 
<td></td> 
<td class='opt recording'> 
</td> 
</tr> 
</tr> 
<tr class='o'> 
<tr> 
<td>24 Dec 00:03</td> 
<td>12345678</td> 
<td>12345679</td> 
<td>00:00:09</td> 
<td></td> 
<td class='opt recording'> 
</td> 
</tr> 
</tr> 
<tr class='e'> 
<tr> 
<td>23 Dec 14:56</td> 
<td>12345678</td> 
<td>12345679</td> 
<td>00:00:07</td> 
<td></td> 
<td class='opt recording'> 
</td> 
</tr> 
</tr> 
<tr class='o'> 
<tr> 
<td>21 Dec 13:26</td> 
<td>07793770851</td> 
<td>12345679</td> 
<td>00:00:26</td> 
<td></td> 
<td class='opt recording'> 
</td> 
</tr> 
</tr>

我想找出如何選擇我想要的單元格（時間戳，源，目標和持續時間）並輸出它們。然後我可以擔心輸出到數據庫而不是終端。

我試過使用Selector Gadget，但它只是顯示'td'或'tr：nth-child（6）td，tr：nth-child（2）td'如果我選擇多個。

任何幫助或指針，將不勝感激！

來源

2012-01-05 dannymcc

表中有一種模式可以很容易地使用XPath。具有所需信息的行的<tr>標記缺少class屬性。幸運的是，XPath提供了一些簡單的邏輯操作，包括not()。這提供了我們需要的功能。

一旦我們減少了處理的行數，我們就可以遍歷行並通過使用XPath的element[n]選擇器來提取必要列的文本。這裏的一個重要注意事項是XPath對從1開始的元素進行計數，所以表格行的第一列應該是td[1]。通過引入nokogiri（和規格）

示例代碼：

require "rspec" 
require "nokogiri" 

HTML = <<HTML 
<table> 
    <thead> 
    <tr> 
     <td> 
     Timestamp 
     </td> 
     <td> 
     Source 
     </td> 
     <td> 
     Destination 
     </td> 
     <td> 
     Duration 
     </td> 
     <td> 
     Cost 
     </td> 
     <td class='centre'> 
     Recording 
     </td> 
    </tr> 
    </thead> 
    <tbody> 
    <tr class='o'> 
     <td></td> 
    </tr> 
    <tr> 
     <td> 
     03 Jan 13:40 
     </td> 
     <td> 
     12345678 
     </td> 
     <td> 
     12345679 
     </td> 
     <td> 
     00:01:14 
     </td> 
     <td></td> 
     <td class='opt recording'></td> 
    </tr> 
    <tr class='e'> 
     <td></td> 
    </tr> 
    <tr> 
     <td> 
     30 Dec 20:31 
     </td> 
     <td> 
     12345678 
     </td> 
     <td> 
     12345679 
     </td> 
     <td> 
     00:02:52 
     </td> 
     <td></td> 
     <td class='opt recording'></td> 
    </tr> 
    <tr class='o'> 
     <td></td> 
    </tr> 
    <tr> 
     <td> 
     24 Dec 00:03 
     </td> 
     <td> 
     12345678 
     </td> 
     <td> 
     12345679 
     </td> 
     <td> 
     00:00:09 
     </td> 
     <td></td> 
     <td class='opt recording'></td> 
    </tr> 
    <tr class='e'> 
     <td></td> 
    </tr> 
    <tr> 
     <td> 
     23 Dec 14:56 
     </td> 
     <td> 
     12345678 
     </td> 
     <td> 
     12345679 
     </td> 
     <td> 
     00:00:07 
     </td> 
     <td></td> 
     <td class='opt recording'></td> 
    </tr> 
    <tr class='o'> 
     <td></td> 
    </tr> 
    <tr> 
     <td> 
     21 Dec 13:26 
     </td> 
     <td> 
     07793770851 
     </td> 
     <td> 
     12345679 
     </td> 
     <td> 
     00:00:26 
     </td> 
     <td></td> 
     <td class='opt recording'></td> 
    </tr> 
    </tbody> 
</table> 
HTML 

class TableExtractor 
    def extract_data html 
    Nokogiri::HTML(html).xpath("//table/tbody/tr[not(@class)]").collect do |row| 
     timestamp = row.at("td[1]").text.strip 
     source  = row.at("td[2]").text.strip 
     destination = row.at("td[3]").text.strip 
     duration = row.at("td[4]").text.strip 
     {:timestamp => timestamp, :source => source, :destination => destination, :duration => duration} 
    end 
    end 
end 

describe TableExtractor do 
    before(:all) do 
    @html = HTML 
    end 

    it "should extract the timestamp properly" do 
    subject.extract_data(@html)[0][:timestamp].should eq "03 Jan 13:40" 
    end 

    it "should extract the source properly" do 
    subject.extract_data(@html)[0][:source].should eq "12345678" 
    end 

    it "should extract the destination properly" do 
    subject.extract_data(@html)[0][:destination].should eq "12345679" 
    end 

    it "should extract the duration properly" do 
    subject.extract_data(@html)[0][:duration].should eq "00:01:14" 
    end 

    it "should extract all informational rows" do 
    subject.extract_data(@html).count.should eq 5 
    end 
end

來源

2012-01-06 20:03:52

我不確定如何將這個代碼應用到我已有的代碼中，如果你看到以下的想法應該是我的想法..https： //gist.github.com/1574942 – dannymcc 2012-01-07 14:53:10

直到現在才注意到您的回覆。我已經[分解了你的要點並添加了一些代碼]（https://gist.github.com/1592493）。我也回答了你關於這個問題的其他問題。 – 2012-01-11 02:03:04

-1

使用XPath選擇器，您應該能夠從根目錄（最差的情況）到達所需的確切節點。與Nokogiri一起使用XPath列出了here。

有關如何使用XPath訪問所有元素的詳細信息，請參閱here。

來源

2012-01-06 06:50:46 jake

是文檔你只與f鏈接或解析XML數據，還是它也可以使用HTML網頁？ – dannymcc 2012-01-06 17:59:43

Yes.Check this one too nokogiri.org/tutorials/searching_a_xml_html_document.html – jake 2012-01-06 19:50:18

你的問題就出在這個railscasts