2016-11-29 64 views
1

我有一個帶有文本列的excel文件。我需要做的就是從文本列中爲每一行提取特定單詞的句子。使用帶有特定單詞的熊貓提取語句

我試過使用定義一個函數。

import pandas as pd 
from nltk.tokenize import sent_tokenize 
from nltk.tokenize import word_tokenize 

#################Reading in excel file##################### 

str_df = pd.read_excel("C:\\Users\\HP\Desktop\\context.xlsx") 

################# Defining a function ##################### 

def sentence_finder(text,word): 
    sentences=sent_tokenize(text) 
    return [sent for sent in sentences if word in word_tokenize(sent)] 
################# Finding Context ########################## 
str_df['context'] = str_df['text'].apply(sentence_finder,args=('snakes',)) 

################# Output file ################################# 
str_df.to_excel("C:\\Users\\HP\Desktop\\context_result.xlsx") 

但有人可以幫助我,如果我一定要找到多個特定單詞的一句話snakesvenomousanaconda。該句子至少應該有一個詞。我無法用多個詞來解決nltk.tokenize

要被搜索words = ['snakes','venomous','anaconda']

輸入Excel文件:

    text 
    1. Snakes are venomous. Anaconda is venomous. 
    2. Anaconda lives in Amazon.Amazon is a big forest. It is venomous. 
    3. Snakes,snakes,snakes everywhere! Mummyyyyyyy!!!The least I expect is an anaconda.Because it is venomous. 
    4. Python is dangerous too. 

所需的輸出:

柱稱爲上下文附加到上述文字列。上下文欄應該是這樣的:

1. [Snakes are venomous.] [Anaconda is venomous.] 
2. [Anaconda lives in Amazon.] [It is venomous.] 
3. [Snakes,snakes,snakes everywhere!] [The least I expect is an anaconda.Because it is venomous.] 
4. NULL 

在此先感謝。

+0

請發佈你的'str_df'的[mcve](http://stackoverflow.com/help/mcve)以及你想要的輸出。 –

+1

@JulienMarrec編輯。謝謝。 – user7140275

+0

你的第三個例子用'因爲'有兩個句子,這似乎你想要共同參考的分辨率,這是不容易的。如果你只需要提取句子,它就容易得多(即用!來分隔文本)。此外,請顯示您的當前輸出,即使它是錯誤的。 –

回答

1

方法如下:您

In [1]: df['text'].apply(lambda text: [sent for sent in sent_tokenize(text) 
             if any(True for w in word_tokenize(sent) 
               if w.lower() in searched_words)]) 

0 [Snakes are venomous., Anaconda is venomous.] 
1 [Anaconda lives in Amazon.Amazon is a big forest., It is venomous.] 
2 [Snakes,snakes,snakes everywhere!, !The least I expect is an anaconda.Because it is venomous.] 
3 [] 
Name: text, dtype: object 

看到,有一對夫婦的問題,因爲sent_tokenizer沒有做的工作,因爲正確的標點符號。


更新:處理複數。

這裏有一個更新的DF:

text 
Snakes are venomous. Anaconda is venomous. 
Anaconda lives in Amazon. Amazon is a big forest. It is venomous. 
Snakes,snakes,snakes everywhere! Mummyyyyyyy!!! The least I expect is an anaconda. Because it is venomous. 
Python is dangerous too. 
I have snakes 


df = pd.read_clipboard(sep='0') 

我們可以使用一個詞幹(Wikipedia),如PorterStemmer

from nltk.stem.porter import * 
stemmer = nltk.PorterStemmer() 

首先,讓我們乾和小寫搜索詞:

searched_words = ['snakes','Venomous','anacondas'] 
searched_words = [stemmer.stem(w.lower()) for w in searched_words] 
searched_words 

> ['snake', 'venom', 'anaconda'] 

現在我們能做的改造上面,包括制止和:

print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text) 
          if any(True for w in word_tokenize(sent) 
            if stemmer.stem(w.lower()) in searched_words)])) 

0 [Snakes are venomous., Anaconda is venomous.] 
1 [Anaconda lives in Amazon., It is venomous.] 
2 [Snakes,snakes,snakes everywhere!, The least I expect is an anaconda., Because it is venomous.] 
3 [] 
4 [I have snakes] 
Name: text, dtype: object 

如果你只想要子串匹配,確保searled_words是單數,而不是複數。

print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text) 
          if any([(w2.lower() in w.lower()) for w in word_tokenize(sent) 
            for w2 in searched_words]) 
           ]) 
) 

順便說一句,這就是我可能會創建一個功能與普通for循環的角度來看,這與拉姆達列表內涵是失控的手。

+0

謝謝你的工作。是啊,即使我遇到像「蛇也是有毒的.Python」這樣的問題。我預計輸出是[蛇是有毒的],但是我得到了[蛇也是有毒的.Python],因爲在句子的開頭沒有空格。 – user7140275

+0

即使我在單詞列表中給出了「蛇」,是否還有一種方法可以用'蛇'來判斷句子。我所需要的是與指定詞相匹配的子字符串,以便我不會丟失任何數據來分析上下文。 – user7140275

+0

是的,我會相應地更新 –