2017-09-29 60 views

回答

3

使用正則表達式匹配只能由字母(下劃線),你可以這樣做:

import re 

s = "[email protected] said: I've taken 2 reports to the boss" 
# s = open('text.txt').read() 

tokens = s.strip().split() 
clean_tokens = [t for t in tokens if re.match(r'[^\W\d]*$', t)] 
# ['taken', 'reports', 'to', 'the', 'boss'] 
clean_s = ' '.join(clean_tokens) 
# 'taken reports to the boss' 
0

可能,這將有助於

array = string.split(' ') 
result = [] 
for word in array 
if word.isalpha() 
    result.append(word) 
string = ' '.join(result) 
2

試試這個:

sentence = "[email protected] said: I've taken 2 reports to the boss" 
words = [word for word in sentence.split() if word.isalpha()] 
# ['taken', 'reports', 'to', 'the', 'boss'] 

result = ' '.join(words) 
# taken reports to the boss 
0

你可以使用正則表達式,也可以在構建函數中使用python,如isalpha()

例使用因而isalpha()

result = '' 
with open('file path') as f: 
line = f.readline() 
a = line.split() 
for i in a: 
    if i.isalpha(): 
     print(i+' ',end='') 
0

str.join() +理解會給你一個在線解決方案:

sentence = "[email protected] said: I've taken 2 reports to the boss" 
' '.join([i for i in sentence.split() if i.isalpha()]) 
#'taken reports to the boss' 
2

您可以使用split()isalpha()得到單詞列表誰只有字母字符並且至少有一個字符。

>>> sentence = "[email protected] said: I've taken 2 reports to the boss" 
>>> alpha_words = [word for word in sentence.split() if word.isalpha()] 
>>> print(alpha_words) 
['taken', 'reports', 'to', 'the', 'boss'] 

然後可以使用join()做出的排行榜成一個字符串:

>>> alpha_only_string = " ".join(alpha_words) 
>>> print(alpha_only_string) 
taken reports to the boss 
1

nltk包是專門從事處理文本,並讓您可以使用「標記化」文成字的各種功能。

您可以使用RegexpTokenizerword_tokenize稍作修改。

最簡單,最簡單的是RegexpTokenizer

import nltk 

text = "[email protected] said: I've taken 2 reports to the boss. I didn't do the other things." 

result = nltk.RegexpTokenizer(r'\w+').tokenize(text) 

將返回:

`['asdf', 'gmail', 'com', 'said', 'I', 've', 'taken', '2', 'reports', 'to', 'the', 'boss', 'I', 'didn', 't', 'do', 'the', 'other', 'things']` 

或者您可以使用稍微聰明word_tokenize這是能夠最宮縮像didn't分成didn't

import re 
import nltk 
nltk.download('punkt') # You only have to do this once 

def contains_letters(phrase): 
    return bool(re.search('[a-zA-Z]', phrase)) 

text = "[email protected] said: I've taken 2 reports to the boss. I didn't do the other things." 

result = [word for word in nltk.word_tokenize(text) if contains_letters(word)] 

返回:

['asdf', 'gmail.com', 'said', 'I', "'ve", 'taken', 'reports', 'to', 'the', 'boss', 'I', 'did', "n't", 'do', 'the', 'other', 'things'] 
相關問題