正則表達式跳過一些特定字符

我想清理字符串，使其沒有任何標點符號或數字，它必須只有a-z和A-Z。例如，給定字符串是：正則表達式跳過一些特定字符

"coMPuter scien_tist-s are,,, the rock__stars of tomorrow_ <cool> ????"

需要的輸出是：

['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']

我的解決辦法是

re.findall(r"([A-Za-z]+)" ,string)

我的輸出

['coMPuter', 'scien', 'tist', 's', 'are', 'the', 'rock', 'stars', 'of', 'tomorrow', 'cool']

來源

2017-03-04 Raja Hammad Farooq

你最好的解決辦法是o使用一個簡單的替換刪除所有字符不是az和空間。 ''[^ A-Za-z] +'（你可以用'\ s'代替右括號後面的空格，然後用空格作爲分隔符對字符串進行分割。在正則表達式中，你可以 –

請詳細說明一下，請問？ –

@cfqueryparam謝謝你我在說什麼re.sub（r'（[^ a-zA-Z \ s] +）'，''，s）.split（） –

你不需要使用regul AR表達：

（將字符串成小寫，如果你希望所有的是小寫字母詞），分割的話，那麼篩選出的字與字母開頭：

>>> s = "coMPuter scien_tist-s are,,, the rock__stars of tomorrow_ <cool> ????" 
>>> [filter(str.isalpha, word) for word in s.lower().split() if word[0].isalpha()] 
['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']

在Python 3.x中，filter(str.isalpha, word)應替換爲''.join(filter(str.isalpha, word))，因爲在Python 3.x中，filter會返回一個過濾器對象。

來源

2017-03-04 03:50:12 falsetru

謝謝它適用於我....可以請告訴我，正則表達式是更省時或循環的方法？ –

@RajaHammadFarooq，沒有正則表達式的答案給出，所以我無法比較。 – falsetru

使用re，雖然我不確定這是你想要的，因爲你說你不想要「酷」剩下的東西。

import re 

s = "coMPuter scien_tist-s are,,, the rock__stars of tomorrow_ <cool> ????" 

REGEX = r'([^a-zA-Z\s]+)' 

cleaned = re.sub(REGEX, '', s).split() 
# ['coMPuter', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow', 'cool']

編輯

WORD_REGEX = re.compile(r'(?!<?\S+>)(?=\w)(\S+)') 
CLEAN_REGEX = re.compile(r'([^a-zA-Z])') 

def cleaned(match_obj): 
    return re.sub(CLEAN_REGEX, '', match_obj.group(1)).lower() 

[cleaned(x) for x in re.finditer(WORD_REGEX, s)] 
# ['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']

WORD_REGEX使用任何單詞字符和負前瞻積極前瞻的< ...>。無論非空白，它使過去向前看符號進行分組：

(?!<?\S+>) # negative lookahead 
(?=\w) # positive lookahead 
(\S+) #group non-whitespace

cleaned需要比賽團體和刪除與CLEAN_REGEX

來源

2017-03-04 04:11:06 Crispin

OP要'''計算機'，'科學家'，'是'，''，'搖滾明星'，''，'明天']' – falsetru

是的，這也是一個很好的方法，我也想跳過「」括號內，我應該怎麼做？ –

任何非單詞字符與所有誰回答我的人的建議正確的解決方案，我真的想要，由於每一個...

s = "coMPuter scien_tist-s are,,, the rock__stars of tomorrow_ <cool> ????"  
cleaned = re.sub(r'(<.*>|[^a-zA-Z\s]+)', '', s).split() 
print cleaned

來源

2017-03-04 04:53:41

如果'cool'被''''''包圍''明天_「酷」'包圍它會怎麼樣？ luded？ – falsetru

是的，那麼它應該包括在內。 –

這是一個很好的方法，我認爲 – Crispin

正則表達式跳過一些特定字符

回答

相關問題