2011-03-21 60 views
5

我想知道HtmlAgilityPack讀取包含xsl文件來呈現html的xml文件的最佳方式。 HtmlDocument類中是否有任何設置可以幫助解決這個問題,還是在使用HtmlAgiliyPack加載之前必須找到一種方法來執行轉換?如果是後者,是否有人知道一個好的圖書館或方法進行這種轉變?下面是一個網站的例子,它返回xml和xls文件以及我想要使用的代碼。Can HtmlAgilityPack可以處理xsl文件附帶的xml文件來呈現html嗎?

var uri = new Uri("http://www.skechers.com/"); 
var request = (HttpWebRequest)WebRequest.Create(url); 
var cookieContainer = new CookieContainer(); 

request.CookieContainer = cookieContainer; 
request.UserAgent = @"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5"; 
request.Method = "GET"; 
request.AllowAutoRedirect = true; 
request.Timeout = 15000; 

var response = (HttpWebResponse)request.GetResponse(); 
var page = new HtmlDocument(); 
page.OptionReadEncoding = false; 
var stream = response.GetResponseStream(); 
page.Load(stream); 

此代碼不會引發任何錯誤,但xml是被解析而不是轉換,這正是我想要的。

+2

如果您格式良好的XML,爲什麼要用HtmlAgilityPack呢? – Cameron 2011-03-21 23:50:16

+0

我正在嘗試獲取頁面摘要,即頁面標題和元描述,以及頁面上的img srcs列表。我允許從網絡輸入任何有效的網址。因此,要回答您的問題,我並不總是有格式良好的XML,即使我這樣做,文檔標題和說明將格式不一致。 – 2011-03-21 23:57:42

回答

3

的Html敏捷包可以幫助你在這裏有兩點:

1)它更容易獲取XML處理指令,因爲它解析PI數據爲HTML,所以它會將其轉化爲屬性

2)HtmlDocument實現了IXPathNavigable,因此它可以通過.NET Xslt轉換引擎直接轉換。

這是一段有效的代碼。我不得不添加一個特定的XmlResover來正確處理Xslt轉換,但我認爲這是特定於這個skechers的情況。

public static void DownloadAndProcessXml(string url, string userAgent, string outputFilePath) 
{ 
    using (XmlTextWriter writer = new XmlTextWriter(outputFilePath, Encoding.UTF8)) 
    { 
     DownloadAndProcessXml(url, userAgent, writer); 
    } 
} 

public static void DownloadAndProcessXml(string url, string userAgent, XmlWriter output) 
{ 
    UserAgentXmlUrlResolver resolver = new UserAgentXmlUrlResolver(url, userAgent); 

    // WebClient is an easy to use class. 
    using (WebClient client = new WebClient()) 
    { 
     // download Xml doc. set User-Agent header or the site won't answer us... 
     client.Headers[HttpRequestHeader.UserAgent] = resolver.UserAgent; 
     HtmlDocument xmlDoc = new HtmlDocument(); 
     xmlDoc.Load(client.OpenRead(url)); 

     // determine xslt (note the xpath trick as Html Agility Pack does not support xml processing instructions) 
     string xsltUrl = xmlDoc.DocumentNode.SelectSingleNode("//*[name()='?xml-stylesheet']").GetAttributeValue("href", null); 

     // download Xslt doc 
     client.Headers[HttpRequestHeader.UserAgent] = resolver.UserAgent; 
     XslCompiledTransform xslt = new XslCompiledTransform(); 
     xslt.Load(new XmlTextReader(client.OpenRead(url + xsltUrl)), new XsltSettings(true, false), null); 

     // transform Html/Xml doc into new Xml doc, easy as HtmlDocument implements IXPathNavigable 
     // note the use of a custom resolver to overcome this Xslt resolve requests 
     xslt.Transform(xmlDoc, null, output, resolver); 
    } 
} 

// This class is needed during transformation otherwise there are errors. 
// This is probably due to this very specific Xslt file that needs to go back to the root document itself. 
public class UserAgentXmlUrlResolver : XmlUrlResolver 
{ 
    public UserAgentXmlUrlResolver(string rootUrl, string userAgent) 
    { 
     RootUrl = rootUrl; 
     UserAgent = userAgent; 
    } 

    public string RootUrl { get; set; } 
    public string UserAgent { get; set; } 

    public override object GetEntity(Uri absoluteUri, string role, Type ofObjectToReturn) 
    { 
     WebClient client = new WebClient(); 
     if (!string.IsNullOrEmpty(UserAgent)) 
     { 
      client.Headers[HttpRequestHeader.UserAgent] = UserAgent; 
     } 
     return client.OpenRead(absoluteUri); 
    } 

    public override Uri ResolveUri(Uri baseUri, string relativeUri) 
    { 
     if ((relativeUri == "/") && (!string.IsNullOrEmpty(RootUrl))) 
      return new Uri(RootUrl); 

     return base.ResolveUri(baseUri, relativeUri); 
    } 
} 

你這樣稱呼它:

string url = "http://www.skechers.com/"; 
    string ua = @"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5"; 
    DownloadAndProcessXml(url, ua, "skechers.html"); 
+0

的第一個代碼段中所示,再次感謝,我認爲對於我的目的而言,我擁有的代碼會更好一些。我認爲這是一個通用指南,我會如何推薦您的代碼。順便說一句,HtmlAgilityPack是f *#%ing真棒。 – 2011-03-22 18:09:59

+0

我還想補充說,能夠將html字符串傳遞到HtmlDocument.Load方法而不是手動創建流將會很酷。我確實看到它已經有12個重載! – 2011-03-22 18:12:25

+0

@Adrian Adkison - 爲此目的有一個LoadHtml重載。 – 2011-03-22 19:13:05

2

您應該呈現XML和XSLT的輸出。要做到這一點,你需要下載XML,你已經做到了。接下來解析XML以標識XSL引用。然後您需要下載XSL並將其應用於XML文檔。

這些鏈接可以是有用的

+0

謝謝,我最終這樣做了,但並沒有將此標記爲答案,因爲它沒有實現。 – 2011-03-22 04:17:53

0

這裏是附加代碼我結束了使用一次我接收到的響應。請注意,如果響應是「application/xml」,那麼這就好了,並且您將不得不始終檢查對象的空實例。此外,FormAssetSrc是一個私有函數,它接受href的值並確定它是協議,根或文檔相對並創建完全限定的uri。

var xmlStream = response.GetResponseStream(); 
var xmlDocument = new XPathDocument(xmlStream); 
var styleNode = xmlDocument.CreateNavigator().SelectSingleNode("processing-instruction('xml-stylesheet')"); 
var hrefValue = Regex.Match((styleNode).Value, "href=(\"|')(?<url>.*?)(\"|')"); 
if(hrefValue.Success) 
{ 
    var xslHref = FormAssetSrc(hrefValue.Groups["url"].Value, response.ResponseUri); 
    var xslUri = new Uri(xslHref); 
    var xslRequest = CreateWebRequest(xslUri); 
    var xslResponse = (HttpWebResponse)xslRequest.GetResponse(); 
    var xslStream = new XPathDocument(xslResponse.GetResponseStream()); 
    var xslTransorm = new XslTransform(); 
    var sw = new System.IO.StringWriter(); 
    xslTransorm.Load(xslStream); 
    xslTransorm.Transform(xmlDocument.CreateNavigator(), null, sw); 
    page.Html.LoadHtml(sw.ToString()); 
} 
+0

CreateWebRequest也是一個私人函數,它會創建一個請求,如原始問題 – 2011-03-22 04:09:47