2017-05-05 95 views
-4

我想通過一個具有多個錨標記的html字符串運行正則表達式,並構建鏈接文本字典與其href url。正則表達式來匹配錨標記和它的href

<p>This is a simple text with some embedded <a href="http://example.com/link/to/some/page?param1=77&param2=22">links</a>. This is a <a href="https://exmp.le/sample-page/?uu=1">different link</a>.

如何提取一氣呵成<a>標籤的文字和HREF?

編輯:

func extractLinks(html: String) -> Dictionary<String, String>? { 

    do { 
     let regex = try NSRegularExpression(pattern: "/<([a-z]*)\b[^>]*>(.*?)</\1>/i", options: []) 
     let nsString = html as NSString 
     let results = regex.matchesInString(html, options: [], range: NSMakeRange(0, nsString.length)) 
     return results.map { nsString.substringWithRange($0.range)} 
    } catch let error as NSError { 
     print("invalid regex: \(error.localizedDescription)") 
     return nil 
    } 
} 
+1

你的正則表達式代碼在哪裏? – matt

+0

@matt:他們在等你寫它。 –

+0

它非常糟糕。 – Rao

回答

1

首先,你需要學習NSRegularExpressionpattern的基本語法:

  • pattern不包含分隔符
  • pattern不含改性劑,你需要通過如下信息options
  • 當你wa nt使用元字符\,則需要在Swift字符串中將其轉義爲\\

因此,創造NSRegularExpression實例的行應該是這樣的:

let regex = try NSRegularExpression(pattern: "<([a-z]*)\\b[^>]*>(.*?)</\\1>", options: .caseInsensitive) 

但是,正如你可能已經知道,你的模式不包含任何代碼以匹配href或捕獲它的價值。

像這樣的你的榜樣html工作:

let pattern = "<a\\b[^>]*\\bhref\\s*=\\s*(\"[^\"]*\"|'[^']*')[^>]*>((?:(?!</a).)*)</a\\s*>" 
let regex = try! NSRegularExpression(pattern: pattern, options: .caseInsensitive) 
let html = "<p>This is a simple text with some embedded <a\n" + 
    "href=\"http://example.com/link/to/some/page?param1=77&param2=22\">links</a>.\n" + 
    "This is a <a href=\"https://exmp.le/sample-page/?uu=1\">different link</a>." 
let matches = regex.matches(in: html, options: [], range: NSRange(0..<html.utf16.count)) 
var resultDict: [String: String] = [:] 
for match in matches { 
    let hrefRange = NSRange(location: match.rangeAt(1).location+1, length: match.rangeAt(1).length-2) 
    let innerTextRange = match.rangeAt(2) 
    let href = (html as NSString).substring(with: hrefRange) 
    let innerText = (html as NSString).substring(with: innerTextRange) 
    resultDict[innerText] = href 
} 
print(resultDict) 
//->["different link": "https://exmp.le/sample-page/?uu=1", "links": "http://example.com/link/to/some/page?param1=77&param2=22"] 

記住,我的pattern上面可能錯誤地檢測到病態的一個標籤或錯過一些嵌套結構,也缺乏特色與HTML字符的工作實體...

如果你想讓你的代碼更健壯和通用,你最好考慮採用ColGraff和Rob建議的HTML解析器。