JS從字符串

-1

我試圖理解這段代碼的陣列提取特定字符串：JS從字符串

function extractLinks(input) { 
    var html = input.join('\n'); 
    var regex = /<a\s+([^>]+\s+)?href\s*=\s*('([^']*)'|"([^"]*)|([^\s>]+))[^>]*>/g; 
    var match; 
    while (match = regex.exec(html)) { 
     var hrefValue = match[3]; 
     if (hrefValue == undefined) { 
      var hrefValue = match[4]; 
     } 
     if (hrefValue == undefined) { 
      var hrefValue = match[5]; 
     } 
     console.log(hrefValue); 
    } 
}

通過一切手段，這是一個簡單的功能，即提取所有HREF值，但只有這些，這是真正的hrefs，例如不包括被定義爲class="href"的href，或者外部的A標籤等。這是奇怪的這一切，問題是，我該計算產生的regex是 (<a[\s\S]*?>) 但是當我沒能找到解決的辦法，看着原來的一個，我發現這很長的regex 。試過這個解決方案與我的regex，它不會工作。

可以請，有人解釋，我怎麼解釋這個長的regex。然後，匹配返回一個數組，然後。讓我看看，如果我得到這個while循環的理念：

而（匹配=正則表達式是存在的字符串）{ 東西=匹配[3] /爲什麼3 ??? / 然後如果undefined something = match [4]， if undefined something = match [5]; }

我真的很難理解其背後這一切的機制，以及在regex邏輯。

該輸入由系統生成，該系統將解析10個不同的字符串數組，但讓我們用一個來測試：下面的代碼被解析爲字符串數組，其長度與行，每行是數組中的單獨元素，並且這是該函數的參數輸入。

<!DOCTYPE html> 
<html> 
<head> 
    <title>Hyperlinks</title> 
    <link href="theme.css" rel="stylesheet" /> 
</head> 
<body> 
<ul><li><a href="/" id="home">Home</a></li><li><a 
class="selected" href=/courses>Courses</a> 
</li><li><a href = 
'/forum' >Forum</a></li><li><a class="href" 
onclick="go()" href= "#">Forum</a></li> 
<li><a id="js" href = 
"javascript:alert('hi yo')" class="new">click</a></li> 
<li><a id='nakov' href = 
http://www.nakov.com class='new'>nak</a></li></ul> 
<a href="#empty"></a> 
<a id="href">href='fake'<img src='http://abv.bg/i.gif' 
alt='abv'/></a><a href="#">&lt;a href='hello'&gt;</a> 
<!-- This code is commented: 
    <a href="#commented">commentex hyperlink</a> --> 
</body>

來源

2014-11-14 Sineastra

[**所有都是我所有的失去他的朋友來**]（http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-標籤） – adeneo 2014-11-14 23:25:26

獲取這正則表達式是做一個瞭解，我已經把在線評論中this page，你可以查看。我也在這裏複製它：

<a\s+   # Look for '<a' followed by whitespace 
([^>]+\s+)?  # Look for anything else that isn't 'href=' 
       # such as 'class=' or 'id=' 
href\s*=\s*  # locate the 'href=' with any whitespace around the '=' character 
(
    '([^']*)'  # Look for '...' 
|    # ...or... 
    "([^"]*)  # Look for "..." 
|    # ...or... 
    ([^\s>]+)  # Look anything NOT '>' or spaces 
) 
[^>]*>   # Match anything else up to the closing '>'

這只是打破它分開，所以你可以看到每個部分正在做什麼。至於你對match的問題，我不完全理解你的問題。

來源

2014-11-14 23:37:45 OnlineCop

好吧，謝謝你的正則表達式，看看。而'while while循環的部分是，爲什麼我們需要數組匹配的第三個元素，如果它是未定義的，我們會選擇第四個，然後是第五個。 – Sineastra 2014-11-14 23:41:16

我認爲這裏發生的事情是，有些「被捕獲」的URL部分不需要保留。 [這一個]（http://regex101.com/r/qQ3nA9/3）有一些變化，它只捕獲'href ='部分。在這種情況下，您可以在該頁面的底部看到更換。 – OnlineCop 2014-11-14 23:49:50

您先生，請感謝您。 – Sineastra 2014-11-14 23:54:36

JS從字符串

回答

相關問題