2017-04-20 102 views
0

我目前正在爬取一些網站,並從中檢索信息以存儲到數據庫中供以後使用。我正在使用HtmlAgilityPack,並且我已經爲幾個網站成功完成了這項工作,但出於某種原因,這個問題給我帶來了問題。我對XPath語法還很陌生,所以我可能在那裏搞砸了。XPath檢索<a> href,文本和<span>

什麼繼承人從網站的代碼看起來像我想中檢索:

<form ... id="_subcat_ids_"> 
    <input ....> 
    <ul ...> 
    <li ....> 
     <input .....> 
     <a class="facet-seleection multiselect-facets " 
     .... href="INeedThisHref#1"> 
     Text I Need       //need to retrieve this text between then <a></a> 
     <span class="subtle-note">(2)</span> //I Need that number from inside the span 
     </a> 
    </li> 
    <li ....> 
     <input .....> 
     <a class="facet-seleection multiselect-facets " 
     .... href="INeedThisHref#2"> 
     Text I Need #2      //need to retrieve this text between then <a></a> 
     <span class="subtle-note">(6)</span> //I Need that number from inside the span 
     </a> 
    </li> 

那些每一個代表一個頁面上的項目,但我只對什麼有興趣的發生每個<a></a>。我想從<a>裏面檢索href值,然後在開始和結束之間的文字,然後我需要<span>裏面的文字。我將其他標籤中的內容排除在外,因爲它們無法唯一標識每個項目,<a>內部的類是他們共享的唯一內容,並且它們都在formid="_subcat_ids_"之內。

繼承人我的代碼:

try 
{ 
    string fullUrl = "..."; 
    HtmlWeb web = new HtmlWeb(); 
    ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 | SecurityProtocolType.Tls | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls12; 
    HtmlDocument html = web.Load(fullUrl); 

    foreach (HtmlNode node in html.DocumentNode.SelectNodes("//form[@id='_subcat_ids_']")) //this gets me into the form 
    { 
    foreach (HtmlNode node2 in node.SelectNodes(".//a[@class='facet-selection multiselect-facets ']")) //this should get me into the the <a> tags, but it is throwing a fit with 'object reference not set to an instance of an object' 
    { 
     //get the href 
     string tempHref = node2.GetAttributeValue("href", string.Empty); 
     //get the text between <a> 
     string tempCat = node2.InnerText.Trim(); 
     //get the text between <span> 
     string tempNum = node2.SelectSingleNode(".//span[@class='subtle-note']").InnerText.Trim(); 
    } 
    } 
} 
catch (Exception ex) 
{ 
    Console.Write("\nError: " + ex.ToString()); 
} 

首先foreach循環沒有錯誤,但第二個讓我object reference not set to an instance of an object在哪裏我的第二個foreach循環是行。就像我之前提到的那樣,我對這種語法仍然陌生,我在另一個網站上使用了這種類型的方法,並取得了巨大的成功,但我在這個網站遇到了一些麻煩。任何提示將不勝感激。

+0

檢查提供的詳細資料的正確性,因爲有你'XPath'表達幾個錯別字/不準確和'HTML'像'seleection' /'selection'這樣的樣本,班級名稱中的空格編號... – Andersson

回答

0

好吧,我想通了,繼承人的代碼

foreach (HtmlNode node in html.DocumentNode.SelectNodes("//form[@id='_subcat_ids_']")) 
{ 
    //get the categories, store in list 
    foreach (HtmlNode node2 in node.SelectNodes("..//a[@class='facet-selection multiselect-facets ']//text()[normalize-space() and not(ancestor::span)]")) 
    { 
    string tempCat = node2.InnerText.Trim(); 
    categoryList.Add(tempCat); 
    Console.Write("\nCategory: " + tempCat);   
    } 
    foreach (HtmlNode node3 in node.SelectNodes("..//a[@class='facet-selection multiselect-facets ']")) 
    { 
    //get href for each category, store in list 
    string tempHref = node3.GetAttributeValue("href", string.Empty); 
    LinkCatList.Add(tempHref); 
    Console.Write("\nhref: " + tempHref); 
    //get the number of items from categories, store in list 
    string tempNum = node3.SelectSingleNode(".//span[@class='subtle-note']").InnerText.Trim(); 
    string tp = tempNum.Replace("(", ""); 
    tempNum = tp; 
    tp = tempNum.Replace(")", ""); 
    tempNum = tp; 
    Console.Write("\nNumber of items: " + tempNum + "\n\n"); 
    } 
} 

的作品就像一個魅力