2016-07-15 26 views
1

我有一個關鍵字列表:如何替換字典中的字典值的關鍵字(不區分大小寫)?

keywords = ["test", "Ok", "great stuff", "PaaS", "mydata"] 

和類型的字典列表:

statements = [ 
{"id":"1","text":"Test, this is OK, great stuff, PaaS."}, 
{"id":"2","text":"I would like to test this, Great stuff."} 
] 

期望的行爲

keyword存在於statement['text'](不分情況下),我想用關鍵字的「標記」版本替換關鍵字,即匹配的關鍵字Test將變爲:

<span class="my_class" data-mydata="<a href=&quot;#&quot;>test</a>">Test</span> 

我已經試過

下面是我已經試過,觀測/注意事項是:

01)它不更換關鍵字。

02)如果是,一旦施加標記,我不想要標記中存在的比賽 - 即標記內mydata不應該匹配。

03)我可能已經開始在這個錯誤的方向,並需要從頭開始重新設計邏輯。

Python 2.7版代碼

import re 

keywords = ["test", "ok", "great stuff", "paas"] 

statements = [ 
{"id":"1","text":"Test, this is OK, great stuff, PaaS."}, 
{"id":"2","text":"I would like to test this, Great stuff."} 
] 

keyword_markup = {} 

print "\nKEYWORDS (all lowercase):\n" 

for i in keywords: 
    print "\"" + i + "\" " 

print "\nORIGINAL STATEMENTS:\n" 

for statement in statements: 
    print statement['text'] + "\n" 

statement_counter = 1 
# for each statement 
for statement in statements: 
    print "\nIN STATEMENT " + str(statement_counter) + ": \n" 
    # get the original statement 
    original_statement = statement['text'] 
    # for each keyword in the keyword list 
    for keyword in keywords: 
     # if the keyword is not in the keyword_markup dict 
     # add it (with a lowercase key) 
     if keyword.lower() not in keyword_markup: 
      keyword_markup[keyword.lower()] = "<span class=\"my_class\" data-mydata=\"<a href=&quot;#&quot;>" + keyword + "</a>\">" + keyword + "</span>" 
      print "The key added to the keyword_markup dict is: " + keyword.lower() 
     # if the keyword is in a lowercase version of the statement 
     if keyword in original_statement.lower(): 
      # sanity check - print the matched keyword 
      print "The keyword matched in the statement is: " + keyword 
      # change the text value of the statement "in place" 
      # by replacing the keyword, with its marked up equivalent. 
      # using the original_statement as the source string 
      statement['text'] = re.sub(keyword,keyword_markup[keyword.lower()],original_statement) 
    statement_counter += 1 

print "\nMARKED UP KEYWORDS AVAILABLE:\n" 

for i in keyword_markup: 
    print keyword_markup[i] 

print "\nNEW STATEMENTS:\n" 

for statement in statements: 
    print statement['text'] + "\n" 

結果

KEYWORDS (all lowercase): 

"test" 
"ok" 
"great stuff" 
"paas" 

ORIGINAL STATEMENTS: 

Test, this is OK, great stuff, PaaS. 

I would like to test this, Great stuff. 


IN STATEMENT 1: 

The key added to the keyword_markup dict is: test 
The keyword matched in the statement is: test 
The key added to the keyword_markup dict is: ok 
The keyword matched in the statement is: ok 
The key added to the keyword_markup dict is: great stuff 
The keyword matched in the statement is: great stuff 
The key added to the keyword_markup dict is: paas 
The keyword matched in the statement is: paas 

IN STATEMENT 2: 

The keyword matched in the statement is: test 
The keyword matched in the statement is: great stuff 

MARKED UP KEYWORDS AVAILABLE: 

<span class="my_class" data-mydata="<a href=&quot;#&quot;>test</a>">test</span> 
<span class="my_class" data-mydata="<a href=&quot;#&quot;>paas</a>">paas</span> 
<span class="my_class" data-mydata="<a href=&quot;#&quot;>ok</a>">ok</span> 
<span class="my_class" data-mydata="<a href=&quot;#&quot;>great stuff</a>">great stuff</span> 

NEW STATEMENTS: 

Test, this is OK, great stuff, PaaS. 

I would like to test this, Great stuff. 
+0

您是否嘗試過對輸入進行標記,標記特殊標記,然後將標記重新組合到輸出中? https://docs.python.org/3/library/re.html#writing-a-tokenizer – IceArdor

回答

1

我能夠做到這一點沒有正則表達式,但應用re.sub或re.IGNORECASE re.findall會如果這是你想要去的方向,那麼這是一個開始的好地方(正如你發現的那樣)。

我也開始考慮編寫單通道標記器,但決定多通道系統比一些醜陋的狀態機更容易理解和維護。

以下代碼針對可讀性而非性能進行了優化。

def main(): 
    keywords = ["test", "ok", "great stuff", "paas"] 

    statements = [ 
     {"id":"1","text":"Test, this is OK, great stuff, PaaS."}, 
     {"id":"2","text":"I would like to test this, Great stuff."} 
    ] 

    for statement in statements: 
     m = markup_statement(statement['text'], keywords) 
     print('id={}, text={}'.format(statement['id'], m)) 

產生以下輸出:

id=1, text=<a href="#">Test</a>, this is <a href="#">OK</a>, <a href="#">great stuff</a>, <a href="#">PaaS</a>. 
id=2, text=I would like to <a href="#">test</a> this, <a href="#">Great stuff</a>. 

這裏是支持功能:

def markup_statement(statement, keywords): 
    """Returns a string where keywords in statement are marked up 

    >>> markup_statement('ThIs is a tEst stAtement', ['is', 'test']) 
    'Th<a href="#">Is</a> <a href="#">is</a> a <a href="#">tEst</a> stAtement' 
    """ 
    markedup_statement = [] 
    keywords_lower = {k.lower() for k in keywords} 
    for token in tokenize(statement, keywords): 
     if token.lower() in keywords_lower: 
      markedup_statement.append(markup(token)) 
     else: 
      markedup_statement.append(token) 
    return ''.join(markedup_statement) 

def markup(keyword): 
    """returns the marked up version of a keyword/token (retains the original case) 
    This function provides the same markup regardless of keyword, but it could be 
    modified to provide keyword-specific markup 

    >>> markup("tEst") 
    '<a href="#">tEst</a>' 
    """ 
    return '<a href="#">{}</a>'.format(keyword) 

此標記生成器使得在聲明多次通過,一個通每個關鍵字。關鍵字的順序可能會影響tokenize返回的令牌。例如,如果標記替換功能是markup = {'at': lambda x: '@', 'statement': lambda x: '<code>{}</code>'.format(x)}.get,則'This is a statement statement'可以是'This is a [email protected]''This is a <code>statement</code>'

def tokenize(statement, keywords): 
    """Adapted from https://docs.python.org/3/library/re.html#writing-a-tokenizer 
    Splits statement on keywords 
    Assumes that there is no overlap between keywords in statement 

    >>> tokenize('ThIs is a tEst stAtement', ['is', 'test']) 
    ['Th', 'Is', ' ', 'is', ' a ', 'tEst', ' stAtement'] 
    >>> ''.join(tokenize(statement, keywords)) == statement 
    True 
    """ 
    statement_fragments = [statement] 
    for keyword in keywords: 
     statement_fragments = list(split(statement_fragments, keyword)) 
    return statement_fragments 

這不是一個特別快的分離器,但很簡單,足以解釋這個想法。我可以在這裏使用re.split(pattern, string, flags=re.IGNORECASE),但是當香草python邏輯工作時,我避免了正則表達式,因爲regex代碼很少可讀,也不是特別快。

def split(statement_fragments, keyword): 
    """Split each statement fragment by keywords 
    statement_fragments: list of strings 
    keyword: string 
    returns list of strings, which may be the same length or longer than statement_fragments 

    This repeatedly trims and lowercases strings. If it's a bottleneck, 
    rewrite it with a start and end index slices 

    >>> split(['ThIs is a tEst stAtement'], 'is') 
    ['Th', 'Is', ' ', 'is', ' a tEst stAtement'] 
    """ 
    keyword_lower = keyword.lower() 
    length = len(keyword) 
    for fragment in statement_fragments: 
     i = fragment.lower().find(keyword_lower) 
     while i != -1: 
      yield fragment[:i] 
      yield fragment[i:i+length] 
      fragment = fragment[i+length:] 
      i = fragment.lower().find(keyword_lower) 
     # yield whatever is left over 
     yield fragment 

沒有評論,這是約30行代碼沒有進口。

+0

我需要幾個小時才能完全理解並欣賞它,因爲它略高於我的技能,但謝謝。出於可讀性考慮的榮譽!我做了一些修改,並且按預期工作。在'markup()'函數中,我將其替換爲:'return'{}'.format(keyword)':'return' (keyword.lower(),keyword)',我想讓字典值的改變在'語句'數組中持久化,所以我用'statement ['text'] ='替換了'main()'中的'print'語句。米'和它的工作! – user1063287