Linq XML如何忽略html代碼？

我正在使用Xelement-Linq to XML來解析一些RSS提要。Linq XML如何忽略html代碼？

RSS例：

<item> 
     <title>Waterfront Ice Skating</title> 
     <link>http://www.eventfinder.co.nz/2011/sep/wellington/wellington-waterfront-ice-skating?utm_medium=rss</link> 
     <description>&lt;p&gt;An ice skating rink in Wellington for a limited time only! 

Enjoy the magic of the New Zealand winter at an outdoor skating experience with all the fun and atmosphere of New York&amp;#039;s Rockefeller Centre or Central Park, ...&lt;/p&gt;&lt;p&gt;Wellington | Friday, 30 September 2011 - Sunday, 30 October 2011&lt;/p&gt;</description> 
     <content:encoded><![CDATA[Today, Wellington Waterfront<br/>Wellington]]></content:encoded> 
     <guid isPermalink="false">108703</guid> 
     <pubDate>2011-09-30T10:00:00Z</pubDate> 
     <enclosure url="http://s1.eventfinder.co.nz/uploads/events/transformed/190501-108703-13.jpg" length="5000" type="image/jpeg"></enclosure> 
    </item>

其所有工作正常，但描述元素有很多的HTML標記，我需要刪除。

說明：

<description>&lt;p&gt;An ice skating rink in Wellington for a limited time only! 

    Enjoy the magic of the New Zealand winter at an outdoor skating experience with all the fun and atmosphere of New York&amp;#039;s Rockefeller Centre or Central Park, ...&lt;/p&gt;&lt;p&gt;Wellington | Friday, 30 September 2011 - Sunday, 30 October 2011&lt;/p&gt;</description>

誰能幫助呢？

來源

2011-10-15 Rhys

你是什麼意思「忽略html代碼」。你想提取文本？ – adatapost

@AVD是的，我只想提取文本，並忽略標記。 – Rhys

看看這個鏈接 - http://www.dotnetperls.com/remove-html-tags – adatapost

如果它是一個RSSFeed你爲什麼不結合使用System.ServiceModel.Syndication的SyncicationFeed用XML閱讀器將處理您的XmlEncoded發出

  using (XmlReader reader = XmlReader.Create(@"C:\\Users\\justMe\\myXml.xml")) 
      { 
       SyndicationFeed myFeed = SyndicationFeed.Load(reader); 
       ... 
      }

然後刪除HTML標籤用正則表達式爲建議由@nemesv，或使用類似這樣的東西

public static string StripHTML(this string htmlText) 
    { 
     var reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase); 
     return HttpUtility.HtmlDecode(reg.Replace(htmlText, string.Empty)); 
    }

來源

2011-10-15 10:21:06 tazyDevel

首先，您應該使用System.Net.HttpUtility.HtmlDecode HtmlDecode descirptoin的內容。這將編碼的&lt ;p&gt ;替換爲<p> 然後您可以使用正則表達式刪除HTML標記：Using C# regular expressions to remove HTML tags或其他一些HTML解析庫。

來源

2011-10-15 08:34:50 nemesv

不，它是XmlEncoded，而不是HtmlEncoded。只要獲得XElement.Value就可以了，HtmlDecode可能會出錯。 –

Linq XML如何忽略html代碼？

回答

相關問題