Python unicode搜索沒有給出正確的答案

我試圖搜索印地文字包含文件1中的每個文件一行，並在文件2中的行中找到它們。我必須用找到的單詞數量打印行數。這是代碼：Python unicode搜索沒有給出正確的答案

import codecs 

hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8").readlines() 
words = codecs.open("hypernyms_en2hi.txt", "r", "utf-8").readlines() 
count_arr = [] 

for counter, line in enumerate(hypernyms): 
    count_arr.append(0) 
    for word in words: 
     if line.find(word) >=0: 
      count_arr[counter] +=1 

for iterator, count in enumerate(count_arr): 
if count>0: 
    print iterator, ' ', count

這是找到了一些話，而忽視其他一些輸入文件是：文件-1：

पौधा 
वनस्पति

文件-2：

वनस्पति, पेड़-पौधा 
वस्तु-भाग, वस्तु-अंग, वस्तु_भाग, वस्तु_अंग 
पादप_समूह, पेड़-पौधे, वनस्पति_समूह 
पेड़-पौधा

這給出了輸出：

0 1 
3 1

顯然，它忽略了वनस्पति並僅搜索了पौधा。我也嘗試過其他輸入。它只搜索一個詞。任何想法如何糾正？

來源

2012-04-07 rarora7777

是因爲你沒有在行尾去掉「\ n」的人物造型。所以你不搜索「some_pattern \ n」，而不是「some_pattern」。使用帶（）函數砍其關閉是這樣的：

import codecs 

words = [word.strip() for word in codecs.open("hypernyms_en2hi.txt", "r", "utf-8")] 
hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8") 
count_arr = [] 

for line in hypernyms: 
    count_arr.append(0) 
    for word in words: 
     count_arr[-1] += (word in line) 

for count in enumerate(count_arr): 
    if count: 
     print iterator, ' ', count

來源

2012-04-07 11:20:11

我認爲這個問題是在這裏：

words = codecs.open("hypernyms_en2hi.txt", "r", "utf-8").readlines()

.readlines()將離開在年底換行，這樣你就不會尋找पौधा，你正在尋找पौधा\n，你就只匹配在一行的結尾。如果我使用.read().split()相反，我得到

0 2 
2 1 
3 1

來源

2012-04-07 10:59:03 DSM

感謝。這是問題所在。我對Python非常陌生。 – rarora7777 2012-04-07 12:37:04

將這個代碼，你會看到，因爲空間的，爲什麼出現這種情況，：文件1的第一個字是पौधा[空格] .. ..

for i in hypernyms: 
    print "file1",i 

for i in words: 
    print "file2",i

後count_arr = []和之前爲計數器，線...

來源

2012-04-07 11:33:58 TLSK

Python unicode搜索沒有給出正確的答案

回答

相關問題