2011-02-23 56 views
0

我已經編寫了以下代碼來計算輸入文件sample.txt中包含一段文本的句子,單詞和字符的數量。它的工作原理在給句子和單詞的數量罰款,但沒有給出字符的準確和正確的數量(不包括空格和標點符號)用於計算輸入文件中句子,單詞和字符數的代碼

lines,blanklines,sentences,words=0,0,0,0 
num_chars=0

print '-'*50

try: filename = 'sample.txt' textf = open(filename,'r')c except IOError: print 'cannot open file %s for reading' % filename import sys sys.exit(0)

for line in textf: print line lines += 1 if line.startswith('\n'): blanklines += 1 else:

sentences += line.count('.')+ line.count ('!')+ line.count('?') 

    tempwords = line.split(None) 
    print tempwords 
    words += len(tempwords) 

textf.close()

打印「 - 「* 50 打印 「行:」 行 打印 「空行」,blanklines 打印 「的句子:」 句子 打印 「的話:」 話

進口NLTK 進口nltk.data 進口nltk.tokenize

張開( 'sample.txt的', 'R')爲f: 在F線:0​​NUM_CHARS + = LEN(線)

NUM_CHARS = NUM​​_CHARS - (+字1)

pcount = 0 從nltk.tokenize進口TreebankWordTokenizer 張開( 'sample.txt的', 'R')爲F1: 在F1行: #tokenised_words = nltk.tokenize.word_tokenize(線) tokenizer = TreebankWordTokenizer() tokenised_words = tokenizer.tokenize (line ==) for w in tokenised_words: if((w =='。')|(w ==';')|(w =='!')|(w =='?')): pcount = pcount + 1個 打印 「pcount:」,pcount NUM_CHARS = NUM​​_CHARS - pcount 打印 「字符:」,NUM_CHARS

pcount是標點符號的數量。有人可以建議我需要做出的改變,以找出沒有空格和標點符號的字符的確切數量嗎?

+1

這是功課?如果沒有,我敢肯定,只需幾行shell腳本就可以得到答案。 – 2011-02-23 17:56:49

回答

0

一旦你可以做的事情是當你讀通過它和人物的行迭代增量號:

for character in line: 
    if character.isalnum(): 
     num_chars += 1 

附:您可能想要更改if語句的條件以滿足您的特定需求,例如,如果您想計算$例如。

1

您還可以使用正則表達式替換所有非字母數字字符,然後計算每行中的字符數。

2
import string 

# 
# Per-line counting functions 
# 
def countLines(ln):  return 1 
def countBlankLines(ln): return 0 if ln.strip() else 1 
def countWords(ln):  return len(ln.split()) 

def charCounter(validChars): 
    vc = set(validChars) 
    def counter(ln): 
     return sum(1 for ch in ln if ch in vc) 
    return counter 
countSentences = charCounter('.!?') 
countLetters = charCounter(string.letters) 
countPunct  = charCounter(string.punctuation) 

# 
# do counting 
# 
class FileStats(object): 
    def __init__(self, countFns, labels=None): 
     super(FileStats,self).__init__() 
     self.fns = countFns 
     self.labels = labels if labels else [fn.__name__ for fn in countFns] 
     self.reset() 

    def reset(self): 
     self.counts = [0]*len(self.fns) 

    def doFile(self, fname): 
     try: 
      with open(fname) as inf: 
       for line in inf: 
        for i,fn in enumerate(self.fns): 
         self.counts[i] += fn(line) 
     except IOError: 
      print('Could not open file {0} for reading'.format(fname)) 

    def __str__(self): 
     return '\n'.join('{0:20} {1:>6}'.format(label, count) for label,count in zip(self.labels, self.counts)) 

fs = FileStats(
    (countLines, countBlankLines, countSentences, countWords, countLetters, countPunct), 
    ("Lines", "Blank Lines", "Sentences", "Words", "Letters", "Punctuation") 
) 
fs.doFile('sample.txt') 
print(fs) 

結果

Lines     101 
Blank Lines    12 
Sentences    48 
Words     339 
Letters    1604 
Punctuation    455 
0

試試這個單詞和句子的數量的計數和類似的話得到的概率,

from nltk.tokenize import word_tokenize 
from nltk.tokenize import sent_tokenize 


text_file = open("..//..//static//output.txt", "r") 
lines = text_file.readlines() 
x=0 
tokenized_words = [word_tokenize(i) for i in lines] 
for i in tokenized_words: 

    print(i) #array contain with tokens 
    print(str(len(i))) #word count 

    for j in i: 
     if j== 'words': #simple algo for count number of 'words' to be count 
      x = x+1 

tokenized_sents = [sent_tokenize(k) for k in lines] 

for k in tokenized_sents: 
    print("Sentences"+str(k)) #array contain with sentences 
    print("number of sentences "+str(len(k))) #number of sentences 

print("number of word"+str(x)) 
print("Probability of 'word' in text file "+str(x/len(i))) 
相關問題