2016-12-03 112 views
2

對於我的編程班結束,我必須根據以下描述來創建一個功能:Python - 從文本中提取主題標籤;在標點符號

的參數是一個鳴叫。該函數應該按照它們在推文中出現的順序返回一個包含推文中所有標籤的列表。返回列表中的每個hashtag應該刪除初始散列符號,並且hashtags應該是唯一的。 (如果鳴叫使用相同的主題標籤的兩倍,它被包含在列表中只有一次。該井號標籤的順序應該與鳴叫每個標籤中第一次出現的順序。)

我不確定如何當遇到標點符號時,哈希標籤就會結束(參見第二個doctest示例)。我目前的代碼是不輸出任何東西:

def extract(start, tweet): 
    """ (str, str) -> list of str 

    Return a list of strings containing all words that start with a specified character. 

    >>> extract('@', "Make America Great Again, vote @RealDonaldTrump") 
    ['RealDonaldTrump'] 
    >>> extract('#', "Vote Hillary! #ImWithHer #TrumpsNotMyPresident") 
    ['ImWithHer', 'TrumpsNotMyPresident'] 
    """ 

    words = tweet.split() 
    return [word[1:] for word in words if word[0] == start] 

def strip_punctuation(s): 
    """ (str) -> str 

    Return a string, stripped of its punctuation. 

    >>> strip_punctuation("Trump's in the lead... damn!") 
    'Trumps in the lead damn' 
    """ 
    return ''.join(c for c in s if c not in '!"#$%&\'()*+,-./:;<=>[email protected][\\]^_`{|}~') 

def extract_hashtags(tweet): 
    """ (str) -> list of str 

    Return a list of strings containing all unique hashtags in a tweet. 
    Outputted in order of appearance. 

    >>> extract_hashtags("I stand with Trump! #MakeAmericaGreatAgain #MAGA #TrumpTrain") 
    ['MakeAmericaGreatAgain', 'MAGA', 'TrumpTrain'] 
    >>> extract_hashtags('NEVER TRUMP. I'm with HER. Does #this! work?') 
    ['this'] 
    """ 

    hashtags = extract('#', tweet) 

    no_duplicates = [] 

    for item in hashtags: 
     if item not in no_duplicates and item.isalnum(): 
      no_duplicates.append(item) 

    result = [] 
    for hash in no_duplicates: 
     for char in hash: 
      if char.isalnum() == False and char != '#': 
       hash == hash[:char.index()] 
       result.append() 
    return result 

我很迷茫在這一點上;任何幫助,將不勝感激。先謝謝你。

注意:我們是而不是允許使用正則表達式或導入任何模塊。

+1

那麼..如果你需要結束標點符號,並且沒有*那許多點符號,爲什麼不檢查下一個字符是否是標點符號? – Pythonista

回答

0

你看起來有點失落。解決這些類型問題的關鍵是將問題分成更小的部分,解決這些問題,然後結合結果。你得每一件你需要..:

def extract_hashtags(tweet): 
    # strip the punctuation on the tags you've extracted (directly) 
    hashtags = [strip_punctuation(tag) for tag in extract('#', tweet)] 
    # hashtags is now a list of hash-tags without any punctuation, but possibly with duplicates 

    result = [] 
    for tag in hashtags: 
     if tag not in result: # check that we haven't seen the tag already (we know it doesn't contain punctuation at this point) 
      result.append(tag) 
    return result 

PS:這是一個非常適合於正則表達式解決的問題,但如果你想快速strip_punctuation你可以使用:

def strip_punctuation(s): 
    return s.translate(None, '!"#$%&\'()*+,-./:;<=>[email protected][\\]^_`{|}~')