爲了測試Zipf的定律，我需要編寫一個python腳本來刪除非字母字符的文本文件中的每個單詞。例如：如何刪除非字母字符的每個單詞

[email protected] said: I've taken 2 reports to the boss

到

taken reports to the boss

我應該如何進行？

來源

2017-09-29 Norhther

樣子的正則表達式的工作。 –

使用正則表達式匹配只能由字母（下劃線），你可以這樣做：

import re 

s = "[email protected] said: I've taken 2 reports to the boss" 
# s = open('text.txt').read() 

tokens = s.strip().split() 
clean_tokens = [t for t in tokens if re.match(r'[^\W\d]*$', t)] 
# ['taken', 'reports', 'to', 'the', 'boss'] 
clean_s = ' '.join(clean_tokens) 
# 'taken reports to the boss'

來源

2017-09-29 09:55:39 schwobaseggl

可能，這將有助於

array = string.split(' ') 
result = [] 
for word in array 
if word.isalpha() 
    result.append(word) 
string = ' '.join(result)

來源

2017-09-29 09:57:08

試試這個：

sentence = "[email protected] said: I've taken 2 reports to the boss" 
words = [word for word in sentence.split() if word.isalpha()] 
# ['taken', 'reports', 'to', 'the', 'boss'] 

result = ' '.join(words) 
# taken reports to the boss

來源

2017-09-29 09:59:21 CtheSky

你可以使用正則表達式，也可以在構建函數中使用python，如isalpha（）

例使用因而isalpha（）

result = '' 
with open('file path') as f: 
line = f.readline() 
a = line.split() 
for i in a: 
    if i.isalpha(): 
     print(i+' ',end='')

來源

2017-09-29 09:59:53

str.join() +理解會給你一個在線解決方案：

sentence = "[email protected] said: I've taken 2 reports to the boss" 
' '.join([i for i in sentence.split() if i.isalpha()]) 
#'taken reports to the boss'

來源

2017-09-29 10:04:08 zipa

您可以使用split()是isalpha()得到單詞列表誰只有字母字符並且至少有一個字符。

>>> sentence = "[email protected] said: I've taken 2 reports to the boss" 
>>> alpha_words = [word for word in sentence.split() if word.isalpha()] 
>>> print(alpha_words) 
['taken', 'reports', 'to', 'the', 'boss']

然後可以使用join()做出的排行榜成一個字符串：

>>> alpha_only_string = " ".join(alpha_words) 
>>> print(alpha_only_string) 
taken reports to the boss

來源

2017-09-29 10:11:46 shash678

的nltk包是專門從事處理文本，並讓您可以使用「標記化」文成字的各種功能。

您可以使用RegexpTokenizer或word_tokenize稍作修改。

最簡單，最簡單的是RegexpTokenizer：

import nltk 

text = "[email protected] said: I've taken 2 reports to the boss. I didn't do the other things." 

result = nltk.RegexpTokenizer(r'\w+').tokenize(text)

將返回：

`['asdf', 'gmail', 'com', 'said', 'I', 've', 'taken', '2', 'reports', 'to', 'the', 'boss', 'I', 'didn', 't', 'do', 'the', 'other', 'things']`

或者您可以使用稍微聰明word_tokenize這是能夠最宮縮像didn't分成did和n't 。

import re 
import nltk 
nltk.download('punkt') # You only have to do this once 

def contains_letters(phrase): 
    return bool(re.search('[a-zA-Z]', phrase)) 

text = "[email protected] said: I've taken 2 reports to the boss. I didn't do the other things." 

result = [word for word in nltk.word_tokenize(text) if contains_letters(word)]

['asdf', 'gmail.com', 'said', 'I', "'ve", 'taken', 'reports', 'to', 'the', 'boss', 'I', 'did', "n't", 'do', 'the', 'other', 'things']

來源

2017-09-29 10:58:43 Swier

如何刪除非字母字符的每個單詞

回答

可能，這將有助於

相關問題