用於計算輸入文件中句子，單詞和字符數的代碼

我已經編寫了以下代碼來計算輸入文件sample.txt中包含一段文本的句子，單詞和字符的數量。它的工作原理在給句子和單詞的數量罰款，但沒有給出字符的準確和正確的數量（不包括空格和標點符號）用於計算輸入文件中句子，單詞和字符數的代碼

lines,blanklines,sentences,words=0,0,0,0 
num_chars=0 

print '-'*50 

try: filename = 'sample.txt' textf = open(filename,'r')c except IOError: print 'cannot open file %s for reading' % filename import sys sys.exit(0) 

for line in textf: print line lines += 1 if line.startswith('\n'): blanklines += 1 else: 

sentences += line.count('.')+ line.count ('!')+ line.count('?') 

    tempwords = line.split(None) 
    print tempwords 
    words += len(tempwords) 
 

textf.close（） 

打印「 - 「* 50 打印 「行：」 行 打印 「空行」，blanklines 打印 「的句子：」 句子 打印 「的話：」 話 

進口NLTK 進口nltk.data 進口nltk.tokenize 

張開（ 'sample.txt的'， 'R'）爲f： 在F線：0NUM_CHARS + = LEN（線） 

NUM_CHARS = NUM_CHARS - （+字1） 

pcount = 0 從nltk.tokenize進口TreebankWordTokenizer 張開（ 'sample.txt的'， 'R'）爲F1： 在F1行： #tokenised_words = nltk.tokenize.word_tokenize（線） tokenizer = TreebankWordTokenizer（） tokenised_words = tokenizer.tokenize （line ==） for w in tokenised_words： if（（w =='。'）|（w ==';'）|（w =='！'）|（w =='？'））： pcount = pcount + 1個 打印 「pcount：」，pcount NUM_CHARS = NUM_CHARS - pcount 打印 「字符：」，NUM_CHARS

pcount是標點符號的數量。有人可以建議我需要做出的改變，以找出沒有空格和標點符號的字符的確切數量嗎？

來源

2011-02-23 aks

這是功課？如果沒有，我敢肯定，只需幾行shell腳本就可以得到答案。 – 2011-02-23 17:56:49

一旦你可以做的事情是當你讀通過它和人物的行迭代增量號：

for character in line: 
    if character.isalnum(): 
     num_chars += 1

附：您可能想要更改if語句的條件以滿足您的特定需求，例如，如果您想計算$例如。

來源

2011-02-23 17:47:02 Asterisk

您還可以使用正則表達式替換所有非字母數字字符，然後計算每行中的字符數。

來源

2011-02-23 18:04:47 eapen

import string 

# 
# Per-line counting functions 
# 
def countLines(ln):  return 1 
def countBlankLines(ln): return 0 if ln.strip() else 1 
def countWords(ln):  return len(ln.split()) 

def charCounter(validChars): 
    vc = set(validChars) 
    def counter(ln): 
     return sum(1 for ch in ln if ch in vc) 
    return counter 
countSentences = charCounter('.!?') 
countLetters = charCounter(string.letters) 
countPunct  = charCounter(string.punctuation) 

# 
# do counting 
# 
class FileStats(object): 
    def __init__(self, countFns, labels=None): 
     super(FileStats,self).__init__() 
     self.fns = countFns 
     self.labels = labels if labels else [fn.__name__ for fn in countFns] 
     self.reset() 

    def reset(self): 
     self.counts = [0]*len(self.fns) 

    def doFile(self, fname): 
     try: 
      with open(fname) as inf: 
       for line in inf: 
        for i,fn in enumerate(self.fns): 
         self.counts[i] += fn(line) 
     except IOError: 
      print('Could not open file {0} for reading'.format(fname)) 

    def __str__(self): 
     return '\n'.join('{0:20} {1:>6}'.format(label, count) for label,count in zip(self.labels, self.counts)) 

fs = FileStats(
    (countLines, countBlankLines, countSentences, countWords, countLetters, countPunct), 
    ("Lines", "Blank Lines", "Sentences", "Words", "Letters", "Punctuation") 
) 
fs.doFile('sample.txt') 
print(fs)

結果

Lines     101 
Blank Lines    12 
Sentences    48 
Words     339 
Letters    1604 
Punctuation    455

來源

2011-02-23 18:21:17

試試這個單詞和句子的數量的計數和類似的話得到的概率，

from nltk.tokenize import word_tokenize 
from nltk.tokenize import sent_tokenize 


text_file = open("..//..//static//output.txt", "r") 
lines = text_file.readlines() 
x=0 
tokenized_words = [word_tokenize(i) for i in lines] 
for i in tokenized_words: 

    print(i) #array contain with tokens 
    print(str(len(i))) #word count 

    for j in i: 
     if j== 'words': #simple algo for count number of 'words' to be count 
      x = x+1 

tokenized_sents = [sent_tokenize(k) for k in lines] 

for k in tokenized_sents: 
    print("Sentences"+str(k)) #array contain with sentences 
    print("number of sentences "+str(len(k))) #number of sentences 

print("number of word"+str(x)) 
print("Probability of 'word' in text file "+str(x/len(i)))

來源

2016-11-18 09:15:18

用於計算輸入文件中句子，單詞和字符數的代碼

回答

相關問題