合併具有相同的內容，但在Python

重疊的HTML標籤多串本身不覺得自己是一個清晰的問題標題，所以我會提供，而不是一個例子：合併具有相同的內容，但在Python

我有一個例子字符串：

Created and managed websites for clients to communicate securely

，它有很多「版本」。這意味着字符串的「版本」中的詞或短語將被包含在HTML div標籤中，即<div style="font-size: 0.1000000">foo bar</div>。（這些標記是任意的，考慮到字體大小屬性的數字對應於稍後將被用作其它CSS特性，現在是不相關的分數。）這裏有4個版本的字符串：

Created and <div style="font-size: 1">managed</div> websites for clients to communicate securely 
Created and <div style="font-size: 2">managed websites</div> for clients to communicate securely 
Created and managed websites for clients to <div style="font-size: 3">communicate</div> securely 
<div style="font-size: 4">Created</div> and managed websites for clients to communicate securely

我想所有這些版本合併到這一點：

<div style="font-size: 4">Created</div> and <div style="font-size: 2"><div style="font-size: 1">managed</div> websites</div> for clients to <div style="font-size: 3">communicate</div> securely

我們在這裏看到，有（有font-size: 2和font-size: 1在標籤重疊）標籤。字符串版本的數量可以在1到50之間的任何地方，因此可能有多個重疊。

這裏是我到目前爲止使用正則表達式：

import re 

div_str = "<div style=.*</div>" # the div tags 
div_text_str = "(?<=(>)).*(?=(</div>))" # the content inside the div tags 

# compile the regexes 
div_regex = re.compile(div_str) 
div_text_regex = re.compile(div_text_str) 

def merge_strings(str1, str2): 

    # grab the div tag off the first version 
    div = div_regex.search(str1).group() 
    # grab the contents of that div tag 
    div_text = div_text_regex.search(div).group() 

    # find the div content in the second version, then substitute 
    # with the div tag 
    return re.sub(div_text, div, str2)

我運行在一個循環此功能，並試圖在同一時間合併兩個字符串，直到我得到的最終輸出。我面臨的問題是，重疊標籤不適用於此函數，因爲正則表達式模式不匹配它。此外，一次替換多個div標籤失敗。

任何幫助，將不勝感激！

來源

2017-08-27 kug3lblitz

我想通了。使用BeautifulSoup替換正則表達式以使解析更容易，我按div標籤之間的文本長度對這些版本進行排序，以避免遇到任何與查找子字符串有關的問題。

使用相同的樣品：

Created and <div style="font-size: 1">managed</div> websites for clients to communicate securely 
Created and <div style="font-size: 2">managed websites</div> for clients to communicate securely 
Created and managed websites for clients to <div style="font-size: 3">communicate</div> securely 
<div style="font-size: 4">Created</div> and managed websites for clients to communicate securely

線被在一個列表中，然後通過使用BeautifulSoup它們相應div標籤之間的文本的長度來分類的表示。以下是代碼：

def __merge_strings(final_str, version): 

    soup = BeautifulSoup(final_str, "html.parser") 

    for fixed_div in soup.find_all("div"): 
     if not fixed_div.text == version.text: 
      return final_str.replace(
       version.text, unicode(version) 
      ) 

    return final_str 

found_terms = (
    (i, BeautifulSoup(i, "html.parser").find("div")) 
    for i in found_terms 
) # list of pairs of the version and its div text 
found_terms = sorted(
    found_terms, key=lambda x: len(x[-1].text), reverse=True 
) # sort on the length of the div text to avoid issues with substrings 

current_div = found_terms[0][0] # version with the largest div text 
for i in xrange(1, len(found_terms)): 
    current_div = __merge_strings(current_div, found_terms[i][-1])

來源

2017-08-29 22:07:45 kug3lblitz

這是不是一個正確的答案。

我會提到用正則表達式解析HTML通常會讓生活變得不必要的困難。最好使用一個解析器，如BeautifulSoup，lxml，scrapy等。

很容易從每個提供的行中恢復文本作爲示例。我認爲每個都是更大的建築的一部分;因此，我已將每一個都附在div之內。

在這裏，我使用BeautifulSoup從每一行中獲取文本。

>>> for line in open('temp.htm').readlines(): 
...  line = line.strip() 
...  print(line) 
...  soup = bs4.BeautifulSoup(line, 'lxml') 
...  soup.find('div').text 
...  
<div>Created and <div style="font-size: 1">managed</div> websites for clients to communicate securely</div> 
'Created and managed websites for clients to communicate securely' 
<div>Created and <div style="font-size: 2">managed websites</div> for clients to communicate securely</div> 
'Created and managed websites for clients to communicate securely' 
<div>Created and managed websites for clients to <div style="font-size: 3">communicate</div> securely</div> 
'Created and managed websites for clients to communicate securely' 
<div><div style="font-size: 4">Created</div> and managed websites for clients to communicate securely</div> 
'Created and managed websites for clients to communicate securely'

不幸的是，我不明白如何將輸入行映射到輸出HTML。

來源

2017-08-27 17:16:00

BeautifulSoup正在項目中用於解析輸入的HTML！您從這些行中提取的文本已經存在 - 它們正在用於返回這些不同版本的不同分析中。最終，該項目解析一個HTML文件 - >做分析，生成這些版本 - >替換原來的HTML內容與所有這些版本合併在我的例子中。 – kug3lblitz

合併具有相同的內容，但在Python

回答

相關問題