div標籤的遞歸正則表達式（不是試圖用正則表達式解析html）

我有一堆wiki標記，有時候人們只是隨意在wiki標記中拋出html，並以某種方式維基百科隨它一起滾動，就像它爲各種其他形式很差的維基標記。我想匹配div內的所有內容。div標籤的遞歸正則表達式（不是試圖用正則表達式解析html）

我需要遞歸地找到所有<div>blah</div>標籤，其中包括div標籤和其他div標籤。我試圖匹配div標籤及其中的所有內容。我有這個，我相信幾乎工程：

new Regex(@"\<div.*?\> (?<DEPTH>)     # opening 
      (?>    # now match... 
       [^(\<div.*?\>)(\<\/div\>)]+   # any characters except divs 
      |     # or 
       \<div.*?\> (?<DEPTH>) # a opening div, increasing the depth counter 
      |     # or 
       \<\/div\> (?<-DEPTH>) # a closing div, decreasing the depth counter 
      )*     # any number of times 
      (?(DEPTH)(?!))  # until the depth counter is zero again 
      \<\/div\>     # then match the closing fix", 
      RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);

也許我應該用另一種方法來分析這一點，但在這一點上，這是我需要的最後的正則表達式語句。

下面是一個例子：

<div class="infobox sisterproject" style="font-size: 90%; padding: .5em 1em 1em 1em;"> 
<div style="text-align:center;"> 
Find more about '''{{{display|{{{1|{{PAGENAME}}}}}}}}''' on Wikipedia's [[Wikipedia:Wikimedia sister projects|sister projects]]: 
</div><!-- 
-->{{#ifeq:{{{wikt}}}|no||<!-- 
-->[[File:Wiktionary-logo-en.svg|25px|link=wikt:Special:Search/{{{wikt|{{{1|{{PAGENAME}}}}}}}}|Search Wiktionary]] [[wikt:Special:Search/{{{wikt|{{{1|{{PAGENAME}}}}}}}}|Definitions]] from Wiktionary<br />}}<!-- 
-->{{#ifeq:{{{b}}}|no||<!-- 
-->[[File:Wikibooks-logo.svg|25px|link=b:Special:Search/{{{b|{{{1|{{PAGENAME}}}}}}}}|Search Wikibooks]] [[b:Special:Search/{{{b|{{{1|{{PAGENAME}}}}}}}}|Textbooks]] from Wikibooks<br />}}<!-- 
-->{{#ifeq:{{{q}}}|no||<!-- 
-->[[File:Wikiquote-logo.svg|25px|link=q:Special:Search/{{{q|{{{1|{{PAGENAME}}}}}}}}|Search Wikiquote]] [[q:Special:Search/{{{q|{{{1|{{PAGENAME}}}}}}}}|Quotations]] from Wikiquote<br />}}<!-- 
-->{{#ifeq:{{{s}}}|no||{{#ifeq:{{{author|no}}}|yes|<!-- 
-->[[File:Wikisource-logo.svg|25px|link=s:Special:Search/Author:{{{s|{{{1|{{PAGENAME}}}}}}}}|Search Wikisource]] [[s:Special:Search/Author:{{{s|{{{1|{{PAGENAME}}}}}}}}|Source texts]] from Wikisource<br />|<!-- 
-->[[File:Wikisource-logo.svg|25px|link=s:Special:Search/{{{s|{{{1|{{PAGENAME}}}}}}}}|Search Wikisource]] [[s:Special:Search/{{{s|{{{1|{{PAGENAME}}}}}}}}|Source texts]] from Wikisource<br />}}}}<!-- 
-->{{#ifeq:{{{commons}}}|no||<!-- 
-->[[File:Commons-logo.svg|25px|link=commons:Special:Search/{{{commons|{{{1|{{PAGENAME}}}}}}}}|Search Commons]] [[commons:Special:Search/{{{commons|{{{1|{{PAGENAME}}}}}}}}|Images and media]] from Commons<br />}}<!-- 
-->{{#ifeq:{{{n}}}|no||<!-- 
-->[[File:Wikinews-logo.svg|25px|link=n:Special:Search/{{{n|{{{1|{{PAGENAME}}}}}}}}|Search Wikinews]] [[n:Special:Search/{{{n|{{{1|{{PAGENAME}}}}}}}}|News stories]] from Wikinews<br />}}<!-- 
-->{{#ifeq:{{{v}}}|no||<!-- 
-->[[File:Wikiversity-logo-Snorky.svg|25px|link=v:Special:Search/{{{v|{{{1|{{PAGENAME}}}}}}}}|Search Wikiversity]] [[v:Special:Search/{{{v|{{{1|{{PAGENAME}}}}}}}}|Learning resources]] from Wikiversity<br />}}<!-- 
-->{{#ifeq:{{{species<includeonly>|no</includeonly>}}}|no||<!-- 
-->[[File:Wikispecies-logo.svg|25px|link=species:Special:Search/{{{species<noinclude>|{{{1|{{PAGENAME}}}}}</noinclude>}}}|Search Wikispecies]] [[species:Special:Search/{{{species<noinclude>|{{{1|{{PAGENAME}}}}}</noinclude>}}}|{{{species<noinclude>|{{{1|{{PAGENAME}}}}}</noinclude>}}}]] from Wikispecies}} 
</div><noinclude>

感謝

來源

2011-05-17 thirsty93

誠然，沒有任何Wiki標記的知識，不會簡單地剝離所有的HTML標籤是一個更好的主意？因爲就目前而言，與問題標題相反，你確實試圖用正則表達式解析HTML標記;） – SirViver 2011-05-17 13:46:39

這可能是失敗主義者，但通常我們在同一個句子中有'遞歸'和'正則表達'這兩個單詞，'不可能「並不遙遠;除非您手動運行自己的狀態機來跟蹤深度，自己調用多個正則表達式。正則表達式的狀態機不能處理我認爲的這種事情。但是，如果你說出你想要的那個例子，那可能會有所幫助。 – 2011-05-17 13:48:05

「（不是試圖用正則表達式解析html）」< - 哈哈哈，很好！很明顯，你用「html」和「regex」閱讀了其他一些問題。 ;-) – 2011-05-17 13:48:10

我想是不是要解析HTML與正則表達式，你可以使用 Html Agility pack

來源

2011-05-17 14:03:25 Serghei

絕對沒有其他標籤。這是帶標籤的唯一片段。敏捷html包會接受，而不必在代碼中包裝假的html和頭部和身體標籤？並使用無效的html字符？ – thirsty93 2011-05-17 14:10:34

它可以讓你解析HTML標籤爲Linq To Xml – Serghei 2011-05-17 17:06:09

是啊如果你已經完美地形成了XHTML，那麼Html Agility就沒有必要了，而你使用了.NET正則表達式深度計數器。即便如此，我仍然使用敏捷，它很棒，它處理馬虎的標記。 – LoveMeSomeCode 2011-05-17 20:21:00

new Regex(@"<div\b[^>]*>(?><div\b[^>]*>(?<DEPTH>)|</div>(?<-DEPTH>)|.?)*(?(DEPTH)(?!))</div>", RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);

在一個好主意我花了不少時間修復我的表達，我甚至都沒有完成獲得HTML敏捷性打包和工作的一半工作。

來源

2011-05-17 14:16:24 thirsty93

div標籤的遞歸正則表達式（不是試圖用正則表達式解析html）

回答

相關問題