使用Python和.txt文件

我已經下載了以下詞典從Project Gutenberg的創建字典http://www.gutenberg.org/cache/epub/29765/pg29765.txt（這是25 MB，所以如果你是一個緩慢的連接避免點擊鏈接）使用Python和.txt文件

在文件我正在尋找的關鍵詞是大寫，例如HALLUCINATION，然後在字典中有一些專門用於發音的行，這些行對我來說已經過時了。

我想提取的是定義，用「Defn」表示，然後打印行。我已經想出了這個相當醜陋的'解決方案'

def lookup(search): 
    find = search.upper()     # transforms our search parameter all upper letters 
    output = []        # empty dummy list 
    infile = open('webster.txt', 'r')  # opening the webster file for reading 
    for line in infile: 
     for part in line.split(): 
      if (find == part): 
       for line in infile: 
        if (line.find("Defn:") == 0): # ugly I know, but my only guess so far 
         output.append(line[6:]) 
         print output    # uncertain about how to proceed 
         break

現在這當然只打印「Defn：」後面的第一行。在Python中處理.txt文件時，我是新手，因此對於如何進行操作一無所知。我確實讀過一個元組中的行，並注意到有特殊的新行字符。

所以我想以某種方式告訴Python繼續閱讀，直到它用完我想的新行字符，但也不計算最後一行必須閱讀。

有人可以請提高我有用的功能，我可能可以用來解決這個問題（與一個最小的例子，將不勝感激）。期望的輸出的

例：

查找（「幻覺」）

出：向漂移;誤入歧途;犯錯;失誤 - 使用精神進程。 [R.]拜倫。

查找（「幻覺」）

出：其不具有現實，或對象的感知\ r \ n 感覺不具有相應的外部原因，從\ r \ n 紊亂所引起的或神經系統，如deli妄的震顫;妄想。\ r \ n 幻覺總是大腦混亂的證據，並且是精神錯亂的常見現象。 W. A. Hammond。

從文本：

HALLUCINATE 
Hal*lu"ci*nate, v. i. Etym: [L. hallucinatus, alucinatus, p. p. of 
hallucinari, alucinari, to wander in mind, talk idly, dream.] 

Defn: To wander; to go astray; to err; to blunder; -- used of mental 
processes. [R.] Byron. 

HALLUCINATION 
Hal*lu`ci*na"tion, n. Etym: [L. hallucinatio cf. F. hallucination.] 

1. The act of hallucinating; a wandering of the mind; error; mistake; 
a blunder. 
This must have been the hallucination of the transcriber. Addison. 

2. (Med.) 

Defn: The perception of objects which have no reality, or of 
sensations which have no corresponding external cause, arising from 
disorder or the nervous system, as in delirium tremens; delusion. 
Hallucinations are always evidence of cerebral derangement and are 
common phenomena of insanity. W. A. Hammond. 

HALLUCINATOR 
Hal*lu"ci*na`tor, n. Etym: [L.]

來源

2014-10-20 Spaced

爲什麼不使用'urllib'訪問該文件？ – Beginner 2014-10-20 17:12:23

@Beginner，我不知道這個函數，我只用了3周的代碼就可以在Python中使用:-)但是感謝你提及它，我將不得不穀歌它。但是訪問這個文件並不是我的問題，'閱讀'它是。 – Spaced 2014-10-20 17:13:37

@Beginner：OP是否詢問獲取文件？沒有.. – RickyA 2014-10-20 17:13:44

這裏每學期

至少一個定義是一個返回函數第一個定義：

def lookup(word): 
    word_upper = word.upper() 
    found_word = False 
    found_def = False 
    defn = '' 
    with open('dict.txt', 'r') as file: 
     for line in file: 
      l = line.strip() 
      if not found_word and l == word_upper: 
       found_word = True 
      elif found_word and not found_def and l.startswith("Defn:"): 
       found_def = True 
       defn = l[6:] 
      elif found_def and l != '': 
       defn += ' ' + l 
      elif found_def and l == '': 
       return defn 
    return False 

print lookup('hallucination')

說明：我們必須考慮四種不同情況。

我們還沒有找到單詞。我們必須將當前行與大寫字母中要查找的單詞進行比較。如果他們是平等的，我們找到了這個詞。
我們已經找到這個詞，但還沒有找到定義的開始。因此我們必須尋找以Defn:開頭的行。如果我們發現，我們的行添加到定義（不包括六個字符Defn:。
我們已經找到了定義的開始。在這種情況下，我們只需添加行定義。
我們已經發現定義的開始和當前行是空的定義是完整的，我們返回的定義

如果我們什麼也沒找到，我們返回False

注意：。有一些條目，如CRANE，有多個定義ve代碼無法處理。它只會返回第一個定義。然而，考慮到文件的格式，編寫完美的解決方案並不容易。

來源

2014-10-20 17:43:38

從here我學到一個簡單的方法來處理內存映射文件，就好像它們是字符串中使用它們。然後你可以使用這樣的東西來獲得術語的第一個定義。

def lookup(search): 
    term = search.upper() 
    f = open('webster.txt') 
    s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) 
    index = s.find('\r\n\r\n' + term + '\r\n') 
    if index == -1: 
     return None 
    definition = s.find('Defn:', index) + len('Defn:') + 1 
    endline = s.find('\r\n\r\n', definition) 
    return s[definition:endline] 

print lookup('hallucination') 
print lookup('hallucinate')

假設：

還有就是如果有一個以上的，只有第一個返回

來源

2014-10-20 17:42:51 dreyescat

我將不得不閱讀很多內容才能理解它，但它看起來像一個很好的方法。有沒有辦法使查找「獨特」？意思是他們找到了確切的單詞，例如查找（「疫苗接種」）返回了反對電話的定義 – Spaced 2014-10-20 17:53:38

假設所有術語都在雙重\ r \ n之後，我們可以找到具體的術語。看我的編輯。 – dreyescat 2014-10-20 18:00:08

這也會找到部分匹配 – 2014-10-20 18:26:27

您可以分割成段，並使用搜索詞的索引，找到第一個DEFN後段：使用整個文件返回

def find_def(f,word): 
    import re 
    with open(f) as f: 
     lines = f.read() 
     try: 
      start = lines.index("{}\r\n".format(word)) # find where our search word is 
     except ValueError: 
      return "Cannot find search term" 
     paras = re.split("\s+\r\n",lines[start:],10) # split into paragraphs using maxsplit = 10 as there are no grouping of paras longer in the definitions 
     for para in paras: 
      if para.startswith("Defn:"): # if para startswith Defn: we have what we need 
       return para # return the para 

print(find_def("in.txt","HALLUCINATION"))

：

In [5]: print find_def("gutt.txt","VACCINATOR") 
Defn: One who, or that which, vaccinates. 

In [6]: print find_def("gutt.txt","HALLUCINATION") 
Defn: The perception of objects which have no reality, or of 
sensations which have no corresponding external cause, arising from 
disorder or the nervous system, as in delirium tremens; delusion. 
Hallucinations are always evidence of cerebral derangement and are 
common phenomena of insanity. W. A. Hammond.

略短的版本：

def find_def(f,word): 
    import re 
    with open(f) as f: 
     lines = f.read() 
     try: 
      start = lines.index("{}\r\n".format(word)) 
     except ValueError: 
      return "Cannot find search term" 
     defn = lines[start:].index("Defn:") 
     return re.split("\s+\r\n",lines[start+defn:],1)[0]

來源

2014-10-20 18:05:17

使用Python和.txt文件

回答

相關問題