爲了測試Zipf的定律,我需要編寫一個python腳本來刪除非字母字符的文本文件中的每個單詞。 例如:如何刪除非字母字符的每個單詞
[email protected] said: I've taken 2 reports to the boss
到
taken reports to the boss
我應該如何進行?
爲了測試Zipf的定律,我需要編寫一個python腳本來刪除非字母字符的文本文件中的每個單詞。 例如:如何刪除非字母字符的每個單詞
[email protected] said: I've taken 2 reports to the boss
到
taken reports to the boss
我應該如何進行?
使用正則表達式匹配只能由字母(下劃線),你可以這樣做:
import re
s = "[email protected] said: I've taken 2 reports to the boss"
# s = open('text.txt').read()
tokens = s.strip().split()
clean_tokens = [t for t in tokens if re.match(r'[^\W\d]*$', t)]
# ['taken', 'reports', 'to', 'the', 'boss']
clean_s = ' '.join(clean_tokens)
# 'taken reports to the boss'
array = string.split(' ')
result = []
for word in array
if word.isalpha()
result.append(word)
string = ' '.join(result)
試試這個:
sentence = "[email protected] said: I've taken 2 reports to the boss"
words = [word for word in sentence.split() if word.isalpha()]
# ['taken', 'reports', 'to', 'the', 'boss']
result = ' '.join(words)
# taken reports to the boss
你可以使用正則表達式,也可以在構建函數中使用python,如isalpha()
例使用因而isalpha()
result = ''
with open('file path') as f:
line = f.readline()
a = line.split()
for i in a:
if i.isalpha():
print(i+' ',end='')
str.join()
+理解會給你一個在線解決方案:
sentence = "[email protected] said: I've taken 2 reports to the boss"
' '.join([i for i in sentence.split() if i.isalpha()])
#'taken reports to the boss'
您可以使用split()是isalpha()得到單詞列表誰只有字母字符並且至少有一個字符。
>>> sentence = "[email protected] said: I've taken 2 reports to the boss"
>>> alpha_words = [word for word in sentence.split() if word.isalpha()]
>>> print(alpha_words)
['taken', 'reports', 'to', 'the', 'boss']
然後可以使用join()做出的排行榜成一個字符串:
>>> alpha_only_string = " ".join(alpha_words)
>>> print(alpha_only_string)
taken reports to the boss
的nltk
包是專門從事處理文本,並讓您可以使用「標記化」文成字的各種功能。
您可以使用RegexpTokenizer
或word_tokenize
稍作修改。
最簡單,最簡單的是RegexpTokenizer
:
import nltk
text = "[email protected] said: I've taken 2 reports to the boss. I didn't do the other things."
result = nltk.RegexpTokenizer(r'\w+').tokenize(text)
將返回:
`['asdf', 'gmail', 'com', 'said', 'I', 've', 'taken', '2', 'reports', 'to', 'the', 'boss', 'I', 'didn', 't', 'do', 'the', 'other', 'things']`
或者您可以使用稍微聰明word_tokenize
這是能夠最宮縮像didn't
分成did
和n't
。
import re
import nltk
nltk.download('punkt') # You only have to do this once
def contains_letters(phrase):
return bool(re.search('[a-zA-Z]', phrase))
text = "[email protected] said: I've taken 2 reports to the boss. I didn't do the other things."
result = [word for word in nltk.word_tokenize(text) if contains_letters(word)]
返回:
['asdf', 'gmail.com', 'said', 'I', "'ve", 'taken', 'reports', 'to', 'the', 'boss', 'I', 'did', "n't", 'do', 'the', 'other', 'things']
樣子的正則表達式的工作。 –