2012-03-24 76 views
1

在這裏和WPSE上,我詢問了關於這個「過濾器」的幾個不同的僞裝。我現在正在採取一種不同的方法,我想讓它堅實可靠。可靠和有效的定製搜索和替換功能 - preg或str替換

我的情況:

  • 當我創建我的WordPress CMS後,我想運行一個過濾器,搜索特定條款並鏈接替換它們。

  • 我有兩個數組搜索條件:$glossary_terms$species_terms

  • $species_terms是魚類的科學名稱列表,如Apistogramma panduro

  • $glossary_terms是水族飼養詞彙表術語如abdomencaudal-finGram's Method的列表。

有一些細微之處值得注意:

  • 速度是一個問題,因爲我會在後臺運行此過濾器而不是當用戶訪問該頁面或者是作者提交/編輯物種簡介或帖子。

  • 某些過濾後的內容可能包含HTML中包含的這些術語,如<img src="image.jpg" title="Apistogramma panduro male" />。顯然這些不應該被取代。

  • 物種通常被稱爲一個簡寫的屬,所以而不是Apistogramma panduro,你會經常看到A. panduro。這意味着我需要搜索&替換所有種類的術語爲縮寫太 - Apistogramma panduroA. panduroSatanoperca daemonS. daemon

  • 如果caudal-fincaudal在詞彙方面都存在; caudal-fin應先更換。

我正在考慮簡單地增加一個preg_replace其搜索條件,但只有在左邊,右邊的空間(即()term)和一個空格,逗號,感嘆號,句號或連字符(即term(, . ! -))但這不會幫助我不打破圖像的HTML。


實施例內容

<br /> 
It looks very similar to fishes of the <i><a href="species/betta-foerschi" rel="species/betta-foerschi/?hover=true" class="link_species">B. foerschi</a></i> group/complex but its breeding strategy, adult size and observed behaviour preclude its inclusion in that <a href="glossary/a/assemblage" rel="glossary/a/assemblage?hover=true" class="link_glossary">assemblage</a>. 

Instead it appears to be a member of the <i><a href="species/betta-coccina" rel="species/betta-coccina/?hover=true" class="link_species">B. coccina</a></i> group which currently includes <i><a href="species/betta-brownorum" rel="species/betta-brownorum/?hover=true" class="link_species">B. brownorum</a></i>, <i><a href="species/betta-burdigala" rel="species/betta-burdigala/?hover=true" class="link_species">B. burdigala</a></i>, <i><a href="species/betta-coccina" rel="species/betta-coccina/?hover=true" class="link_species">B. coccina</a></i>, <i><a href="species/betta-livida" rel="species/betta-livida/?hover=true" class="link_species">B. livida</a></i>, <i>B. miniopinna</i>, <i><a href="species/betta-persephone" rel="species/betta-persephone/?hover=true" class="link_species">B. persephone</a></i>, <i>B. tussyae</i>, <i><a href="species/betta-rutilans" rel="species/betta-rutilans/?hover=true" class="link_species">B. rutilans</a></i> and <i><a href="species/betta-uberis" rel="species/betta-uberis/?hover=true" class="link_species">B. uberis</a></i>. 

Of these it's most similar in appearance to <i><a href="species/betta-uberis" rel="species/betta-uberis/?hover=true" class="link_species">B. uberis</a></i> but can be distinguished by its noticeably shorter <a href="glossary/d/dorsal" rel="glossary/d/dorsal?hover=true" class="link_glossary">dorsal</a>-<a href="glossary/f/fin" rel="glossary/f/fin?hover=true" class="link_glossary">fin</a> <a href="glossary/b/base" rel="glossary/b/base?hover=true" class="link_glossary">base</a> and overall blue-greenish (vs. green/reddish) colouration. 

Members of this group are characterised by their small adult size (&lt; 40 mm SL), a uniform red or black <a href="glossary/b/base" rel="glossary/b/base?hover=true" class="link_glossary">base</a> body colour, the presence of a <a href="glossary/m/midlateral" rel="glossary/m/midlateral?hover=true" class="link_glossary">midlateral</a> body blotch in some <a href="glossary/s/species" rel="glossary/s/species?hover=true" class="link_glossary">species</a> and the fact they have 9 abdominal <a href="glossary/v/vertebrae" rel="glossary/v/vertebrae?hover=true" class="link_glossary">vertebrae</a> compared with 10-12 in the other <a href="glossary/s/species" rel="glossary/s/species?hover=true" class="link_glossary">species</a> groups. In addition all are <a href="glossary/o/obligate" rel="glossary/o/obligate?hover=true" class="link_glossary">obligate</a> <a href="glossary/p/peat" rel="glossary/p/peat?hover=true" class="link_glossary">peat</a> <a href="glossary/s/swamp" rel="glossary/s/swamp?hover=true" class="link_glossary">swamp</a> dwellers (Tan and Ng, 2005).<br /> 

^^^這裏本實施例已手動插入的正確鏈接。過濾器不應該打破這些鏈接!

It looks very similar to fishes of the B. foerschi group/complex but its breeding strategy, adult size and observed behaviour preclude its inclusion in that assemblage. 

Instead it appears to be a member of the B. coccina group which currently includes B. brownorum, B. burdigala, B. coccina, B. livida, B. miniopinna, B. persephone, B. tussyae, B. rutilans and B. uberis. 

Of these it's most similar in appearance to B. uberis but can be distinguished by its noticeably shorter dorsal-fin base and overall blue-greenish (vs. green/reddish) colouration. 

Members of this group are characterised by their small adult size (< 40 mm SL), a uniform red or black base body colour, the presence of a midlateral body blotch in some species and the fact they have 9 abdominal vertebrae compared with 10-12 in the other species groups. In addition all are obligate peat swamp dwellers (Tan and Ng, 2005). 

^^^相同的示例預格式化。

[caption id="attachment_542" align="alignleft" width="125" caption="Amazonas Magazine - now in English!"]<a href="http://www.seriouslyfish.comwp-content/uploads/2011/12/Amazonas-English-1.jpg"><img class="size-thumbnail wp-image-542" title="Amazonas English" src="/wp-content/uploads/2011/12/Amazonas-English-1-288x381.jpg" alt="Amazonas English" width="125" height="165" /></a>[/caption] 

Edited by Hans-Georg Evers, the magazine 'Amazonas' has been widely-regarded as among the finest regular publications in the hobby since its launch in 2005, an impressive achievment considering it's only been published in German to date. The long-awaited English version is just about to launch, and we think a subscription should be top of any serious fishkeeper's Xmas list... 

The magazine is published in a bi-monthly basis and the English version launches with the January/February 2012 issue with distributors already organised in the United States, Canada, the United Kingdom, South Africa, Australia, and New Zealand. There are also mobile apps availablen which allow digital subscribers to read on portable devices. 

It's fair to say that there currently exists no better publication for dedicated hobbyists with each issue featuring cutting-edge articles on fishes, invertebrates, aquatic plants, field trips to tropical destinations plus the latest in husbandry and breeding breakthroughs by expert aquarists, all accompanied by excellent photography throughout. 

U.S. residents can subscribe to the printed edition for just $29 USD per year, which also includes a free digital subscription, with the same offer available to Canadian readers for $41 USD or overseas subscribers for $49 USD. Please see the <a href="http://www.amazonasmagazine.com/">Amazonas website</a> for further information and a sample digital issue! 

Alternatively, subscribe directly to the print version <a href="https://www.amazonascustomerservice.com/subscribe/index2.php">here</a> or digital version <a href="https://www.amazonascustomerservice.com/subscribe/digital.php">here</a>. 

^^^這可能只有幾個詞彙表術語而不是任何物種的鏈接。


舉例而言

$species_terms

339 => 'Aulonocara maylandi maylandi', 
340 => 'Aulonocara maylandi kandeensis', 
341 => 'Aulonocara sp. "walteri"', 
342 => 'Aulonocara sp. "stuartgranti maleri"', 
343 => 'Aulonocara stuartgranti', 
344 => 'Benthochromis tricoti', 
345 => 'Boulengerochromis microlepis', 
346 => 'Buccochromis lepturus', 
347 => 'Buccochromis nototaenia', 
348 => 'Betta brownorum', 
349 => 'Betta foerschi', 
350 => 'Betta coccina', 
351 => 'Betta uberis' 

正如你可以在上面看到,對於這些科學名稱的通用格式是 「屬物種」,但往往可以包括 「SP」。或「aff」。 (對於沒有正式描述的物種)和「屬種亞種」格式。

$glossary_terms

1 => 'abdomen', 
2 => 'caudal', 
3 => 'caudal-fin', 
4 => 'caudal-fin peduncle', 
5 => 'Gram\'s Method' 

如果有人能拿出符合所有這些條件和要求的過濾器,我想提供一個賞金。

在此先感謝,

+0

只是一個想法。是否有可能向第一「提取物」使用preg_replace_callback和與唯一標識符替換每個'''所有'。之後,你可以運行你自己的preg_replace。之後用各自的內容替換每個獨特的鏈接。瞧? – 2012-03-27 09:12:15

回答

4

我認爲使用DOMDocument功能比regexps更好。這是一個工作原型:

// Each dynamically constructed regexp will contain at most 70 subpatterns 
define('GROUPS_PER_REGEXPS', 70); 

$speciesTerms = array(
    339 => '(?:Aulonocara|A\.) maylandi maylandi', 
    340 => '(?:Aulonocara|A\.) maylandi kandeensis', 
    344 => '(?:Benthochromis|B\.) tricoti', 
    345 => '(?:Boulengerochromis|B\.) microlepis', 
); 

function matchTerms($text) { 
    // Globals are not good. I left it for the simplicity 
    global $speciesTerms; 

    $result = array(); 
    $t = 0; 
    $speciesCount = count($speciesTerms); 
    reset($speciesTerms); 
    while ($t < $speciesCount) { 
    // Maps capturing group identifiers to term ids 
    $termMapping = array(); 

    // Dynamically construct regexp 
    $groups = ''; 
    $c = 1; 
    while (list($termId, $termPattern) = each($speciesTerms)) { 
     if (!empty($groups)) { 
     $groups .= '|'; 
     } 
     // Match word boundaries, so we don't capture "B. tricotisomeramblingstring" 
     $groups .= '(\b' . $termPattern . '\b)'; 
     $termMapping[$c++] = $termId; 
     if (++$t % GROUPS_PER_REGEXPS == 0) { 
     break; 
     } 
    } 
    $regexp = "/$groups/m"; 
    preg_match_all($regexp, $text, $matches, PREG_OFFSET_CAPTURE); 
    for ($i = 1; $i < $c; $i++) { 
     foreach ($matches[$i] as $matchData) { 
     // matchData[0] holds matched string, e.g. Benthochromis tricoti 
     // matchData[1] holds offset, e.g. 15 
     if (isset($matchData[0]) && !empty($matchData[0])) { 
      $result[] = array(
      'text' => $matchData[0], 
      'offset' => $matchData[1], 
      'id' => $termMapping[$i], 
     ); 
     } 
     } 
    } 
    } 
    // Sort by offset in descending order 
    usort($result, function($a, $b) { 
    return $a['offset'] > $b['offset'] ? -1 : 1; 
    }); 
    return $result; 
} 

$doc = DOMDocument::loadHTML($html); 

// Stack will be used to avoid recursive functions 
$stack = new SplStack; 
$stack->push($doc); 
while (!$stack->isEmpty()) { 
    $node = $stack->pop(); 
    if ($node->nodeType == XML_TEXT_NODE && $node->parentNode instanceof DOMElement) { 
    // $node represents text node 
    // and it's inside a tag (second condition in the statement above) 

    // Check that this text is not wrapped in <a> tag 
    // as we don't want to wrap it twice 
    if ($node->parentNode->tagName != 'a') { 
     $matches = matchTerms($node->wholeText); 
     foreach ($matches as $match) { 
     // Create new link element in the DOM 
     $link = $doc->createElement('a', $match['text']); 
     $link->setAttribute('href', 'species/' . $match['id']); 
     $link->setAttribute('class', 'link_species'); 

     // Save the text after the link 
     $remainingText = $node->splitText($match['offset'] + strlen($match['text'])); 
     // Save the text before the link 
     $linkText = $node->splitText($match['offset']); 

     // Replace $linkText with $link node 
     // i.e. 'something' becomes '<a href="..">something</a>' 
     $node->parentNode->replaceChild($link, $linkText); 
     } 
    } 
    } 
    if ($node->hasChildNodes()) { 
    foreach ($node->childNodes as $childNode) { 
     $stack->push($childNode); 
    } 
    } 
} 

$body = $doc->getElementsByTagName('body'); 
echo $doc->saveHTML($body->item(0)); 

實現細節

我只展示瞭如何更換品種方面,詞彙方面將是一樣的。鏈接以「species/$ id」形式形成。縮寫處理正確。 DOMDocument是一個非常可靠的解析器,它可以處理破碎的標記並且速度很快。

?: in regexp不允許將此子模式計爲捕獲組(documentation on subpatterns)。如果沒有適當的子模式計數,我們無法檢索termId。我們的想法是通過連接$speciesTerms數組中指定的所有正則表達式並使用管道|將它們分開來構建一個大的正則表達式模式。前兩個品種最終的正則表達式是(爲了清楚起見空格):

 First capturing group    Alternation  Second capturing group 
((?:Aulonocara|A\.) maylandi maylandi)  |  ((?:Aulonocara|A\.) maylandi kandeensis) 

因此,文本 「例子:Aulonocara maylandi maylandi,A maylandi kandeensis」 將給以下匹配:

$matches[1] = array('Aulonocara maylandi maylandi') // Captured by the first group 
$matches[2] = array('A. maylandi kandeensis') // Captured by the second group 

我們可以清楚地說,matches[1]中的所有元素都是指品種Aulonocara maylandi maylandiA. maylandi maylandi,其id = 339。如果您在$speciesTerms中使用子模式,請使用(?:)

UPDATE 每個動態創建的regexp對最大數目的子模式,其被定義爲在頂部具有常量的限制。這可以避免PCRE限制regexp中子模式的數量。

重要提示:

  • 如果你有很多方面的,你應該重寫matchTerms,因爲正則表達式有許多子模式的限制。在這種情況下,最好在每N個術語中預先構建一組正則表達式。
  • matchTerms在每次調用生成的正則表達式,這顯然只能做一次
  • 這是可能的,如果你使用多字節編碼
  • 提供$html將被包裹在speciesTerms
  • strlen =>mb_strlen使用先進的正則表達式在<body>標籤(除非其已經包裹)
+0

非常棒的回覆,謝謝。我現在試着去實現它,但是我有點擔心如果它不起作用,那麼調試就太複雜了!什麼的'(?:'物種方面做 – dunc 2012-03-24 23:17:55

+1

@dunc之前是啊,它看起來像一個大量的代碼,但它給你一個很大的靈活性此外,匹配代碼('matchTerms')可以從單獨測試? (用'DOMDocument'操作)替換代碼這意味着它會更容易調試和測試 – galymzhan 2012-03-25 04:00:53

+0

@galymzhan您好,感謝您的回覆遺憾的是,由於1360名的名字,我收到以下錯誤:'警告: preg_match_all():編譯失敗:?正則表達式是太大,無法在上線14' /dunc/test.php偏移量34743有沒有什麼可以做,以優化表達我試着增加必要的PCRE設置' ini_set'但它並沒有任何區別 – dunc 2012-03-26 13:09:45

2

解析HTML而不是嘗試使用正則表達式會更好。當你有特定的東西需要匹配時,正則表達式是很好的,但當你試圖不匹配某些東西時會變得古怪。

使用http://simplehtmldom.sourceforge.net/

function addLinks(&$p, $species, $terms) { 

    // much easier to say "not in an anchor tag" with parsed content than with regex 
    if ($p->tag != 'a') { 

    // pull out existing elements so they aren't replaced 
    $children = array(); 
    $x = 0; 

    foreach ($p->children as &$e) { 
     $children[] = $e->outertext; 
     $e->outertext = '---child-'.$x.'---'; 
     $x++; 
    } 

    foreach($species as $s) { 
     $p->innertext = str_replace(
      $s, 
      '<a href="species/'.strtolower(str_replace(' ','-',$s)).'">'.$s.'</a>', 
      $p->innertext); 
    } 

    foreach($term as $t) { 
     $p->innertext = str_replace(
      $t, 
      '<a href="glossary/'. 
       strtolower($t[0]).'/'. 
       strtolower(str_replace(' ','-',$t)).'">'.$t.'</a>', 
      $p->innertext); 
    } 

    // restore previous child elements 
    foreach($children as $x => $e) { 
     $p->innertext = str_replace('---child-'.$x.'---', $e, $p->innertext); 
    } 

    foreach ($p->children() as &$e) { 
     addLinks($e, $species, $terms); 
    } 
    } 
} 


$html = new simple_html_dom(); 

// you may have to wrap $content in a div. not exactly sure how partial content is handled 
$html->load($content); 

addLinks($html, $species_terms, $glossary_terms); 
$content = $html->save(); 

我沒有用simple_html_dom了一大堆,但應該讓你在正確的方向。

+0

我意識到替換innertext可能會給現有的錨標記和圖像帶來同樣的問題,所以我添加了一些內容,然後恢復現有的子元素,以便它們不受父項替換的影響。 – 2012-03-24 18:02:52

+0

我明白你在做什麼,直到這些變化! :)你能解釋一下它的工作原理嗎? – dunc 2012-03-24 18:31:58

+0

$ p-> innertext可能會返回'這是一些文本,這裏是a link它就是它的內部'。因此,str_replace仍然會捕獲該錨內的文本並將其雙重鏈接。我添加的更改臨時刪除任何子元素,進行替換,然後將其添加回來。如果我有機會,我會嘗試實際測試它是否符合我的想法。 – 2012-03-25 09:19:49