2011-04-13 72 views
2

我從gutenberg.org以文本格式拍攝了一本書,並且正在嘗試閱讀文本,但是跳過文件的開始部分,然後使用我編寫的過程函數來解析其餘部分。我怎樣才能做到這一點?閱讀文件並跳過Python中文本文件的標題部分

這是文本文件的開始。

> The Project Gutenberg EBook of The Kama Sutra of Vatsyayana, by Vatsyayana 

This eBook is for the use of anyone anywhere at no cost and with 
almost no restrictions whatsoever. You may copy it, give it away or 
re-use it under the terms of the Project Gutenberg License included 
with this eBook or online at www.gutenberg.net 


Title: The Kama Sutra of Vatsyayana 
     Translated From The Sanscrit In Seven Parts With Preface, 
     Introduction and Concluding Remarks 

Author: Vatsyayana 

Translator: Richard Burton 
      Bhagavanlal Indrajit 
      Shivaram Parashuram Bhide 

Release Date: January 18, 2009 [EBook #27827] 

Language: English 


*** START OF THIS PROJECT GUTENBERG EBOOK THE KAMA SUTRA OF VATSYAYANA *** 




Produced by Bruce Albrecht, Carla Foust, Jon Noring and 
the Online Distributed Proofreading Team at 
http://www.pgdp.net 

和我的代碼,當前處理整個文件。

import string 

def process_file(filename): 
    """ opens a file and passes back a list of its words""" 
    h = dict() 
    fin = open(filename) 
    for line in fin: 
     process_line(line, h) 
    return h 

def process_line(line, h): 
    line = line.replace('-', ' ') 

    for word in line.split(): 
     word = word.strip(string.punctuation + string.whitespace) 
     word = word.lower() 

     h[word] = h.get(word,0)+1 
+0

不要忘記關閉文件。您可能想要使用'with'關鍵字。也就是'open(filename)as fin:'當你退出with context時,上下文管理器會爲你關閉這個文件。和upvoted nightcraker的答案。 – 2011-04-13 19:51:18

回答

3

補充一點:

for line in fin: 
    if "START OF THIS PROJECT GUTENBERG BOOK" in line: 
     break 

之前你自己 「在鰭線:」 循環。

3

好了,你可以只讀取輸入,直到你符合你的標準來跳過開頭:

def process_file(filename): 
    """ opens a file and passes back a list of its words""" 
    h = dict() 
    fin = open(filename) 

    for line in fin: 
     if line.rstrip() == "*** START OF THIS PROJECT GUTENBERG EBOOK THE KAMA SUTRA OF VATSYAYANA ***": 
      break 

    for line in fin: 
     process_line(line, h) 

    return h 

請注意,我在這個例子中使用line.rstrip() == "*** START OF THIS PROJECT GUTENBERG EBOOK THE KAMA SUTRA OF VATSYAYANA ***"作爲一個嚴格的標準,但你可以完全正常的自行設置。