爲表格數據解析嚴重格式化的HTML

我正在編寫一個c＃控制檯應用程序來從外部html網頁檢索表格信息。爲表格數據解析嚴重格式化的HTML

我想提取所有<td>記錄data，match，opponent，result等 - 23日在上面的例子中的鏈接行。

我沒有控制這個網頁，不幸的是沒有很好的格式化，所以我試過的選項，如HtmlAgilityPack和XML解析只是失敗。我也嘗試了一些對正則表達式的，但我這方面的知識是極其惡劣，一個例子下面我想：

這將返回所有<tr>的完整列表（多條記錄，我不需要）但是我無法從中獲取數據。

UPDATE

這裏是我嘗試使用HtmlAgilityPack的例子：

HtmlDocument doc = new HtmlDocument(); 

     doc.LoadHtml(html); 
     foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table")) 
     { 

      foreach (HtmlNode row in table.SelectNodes("tr")) 
      { 
       foreach (HtmlNode cell in row.SelectNodes("td")) 
       { 
        Console.WriteLine(cell.InnerText); 
       } 
      } 
     }

來源

2014-10-02 Matt Webb

試着看一下[這太問題（http://stackoverflow.com/questions/14987878/html-agility-pack-parse-table） – Icemanind 2014-10-02 22:32:45

就像我提到的我的問題，使用Html Agility Pack失敗，因爲該頁面缺少關閉元標記。 – 2014-10-02 22:34:16

如果它對meta標籤有一個特定的有限問題，爲什麼不做一些html.Replace（「畸形的meta」，「更好的meta」）並修復它們？ – MatthewMartin 2014-10-02 22:37:30

我想你只需要修復您的HtmlAgilityPack嘗試。這對我來說工作得很好：

// Skip the first table on that page so we just get results 
foreach (var table in doc.DocumentNode.SelectNodes("//table").Skip(1).Take(1)) { 
    foreach (var td in table.SelectNodes("//td")) { 
     Console.WriteLine(td.InnerText); 
    } 
}

這轉儲從結果表中的數據，每行一個列的堆，到控制檯。

來源

2014-10-02 23:10:58

謝謝！正是我需要的:-) – 2014-10-02 23:16:21

沒問題。很高興我能幫上忙！ :) – 2014-10-02 23:16:42

如果你想要一個完整的程序:)。我找了幾個小時。

類ReadHTML {

internal void ReadText() 
    { 
     try 
     { 
      FolderBrowserDialog fbd = new FolderBrowserDialog(); 
      fbd.RootFolder = Environment.SpecialFolder.MyComputer;//This causes the folder to begin at the root folder or your documents 
      if (fbd.ShowDialog() == DialogResult.OK) 
      { 
       string[] files = Directory.GetFiles(fbd.SelectedPath, "*.html", SearchOption.AllDirectories);//change this to specify file type 
       SaveFileDialog sfd = new SaveFileDialog();// Create save the CSV 
       //sfd.Filter = "Text File|*.txt";// filters for text files only 
       sfd.FileName = "Html Output.txt"; 
       sfd.Title = "Save Text File"; 
       if (sfd.ShowDialog() == DialogResult.OK) 
       { 
        string path = sfd.FileName; 
        using (StreamWriter bw = new StreamWriter(File.Create(path))) 
        { 
         foreach (string f in files) 
         { 

          var html = new HtmlAgilityPack.HtmlDocument(); 
          html.Load(f); 
          foreach (var table in html.DocumentNode.SelectNodes("//table").Skip(1).Take(1))//specify which tag your looking for 
          { 
           foreach (var td in table.SelectNodes("//td"))// this is the sub tag 
           { 
            bw.WriteLine(td.InnerText);// this will make a text fill of what you are looking for in the HTML files 
           } 
          } 

         }//ends loop of files 

         bw.Flush(); 
         bw.Close(); 
        } 
       } 
       MessageBox.Show("Files found: " + files.Count<string>().ToString()); 
      } 
     } 

     catch (UnauthorizedAccessException UAEx) 
     { 
      MessageBox.Show(UAEx.Message); 
     } 
     catch (PathTooLongException PathEx) 
     { 
      MessageBox.Show(PathEx.Message); 
     } 
    }//method ends 
}

來源

2016-02-23 11:49:03

爲表格數據解析嚴重格式化的HTML

回答

相關問題