使用帶有特定單詞的熊貓提取語句

我有一個帶有文本列的excel文件。我需要做的就是從文本列中爲每一行提取特定單詞的句子。使用帶有特定單詞的熊貓提取語句

我試過使用定義一個函數。

import pandas as pd 
from nltk.tokenize import sent_tokenize 
from nltk.tokenize import word_tokenize 

#################Reading in excel file##################### 

str_df = pd.read_excel("C:\\Users\\HP\Desktop\\context.xlsx") 

################# Defining a function ##################### 

def sentence_finder(text,word): 
    sentences=sent_tokenize(text) 
    return [sent for sent in sentences if word in word_tokenize(sent)] 
################# Finding Context ########################## 
str_df['context'] = str_df['text'].apply(sentence_finder,args=('snakes',)) 

################# Output file ################################# 
str_df.to_excel("C:\\Users\\HP\Desktop\\context_result.xlsx")

但有人可以幫助我，如果我一定要找到多個特定單詞的一句話snakes，venomous，anaconda。該句子至少應該有一個詞。我無法用多個詞來解決nltk.tokenize。

要被搜索words = ['snakes','venomous','anaconda']

輸入Excel文件：

    text 
    1. Snakes are venomous. Anaconda is venomous. 
    2. Anaconda lives in Amazon.Amazon is a big forest. It is venomous. 
    3. Snakes,snakes,snakes everywhere! Mummyyyyyyy!!!The least I expect is an anaconda.Because it is venomous. 
    4. Python is dangerous too.

所需的輸出：

柱稱爲上下文附加到上述文字列。上下文欄應該是這樣的：

1. [Snakes are venomous.] [Anaconda is venomous.] 
2. [Anaconda lives in Amazon.] [It is venomous.] 
3. [Snakes,snakes,snakes everywhere!] [The least I expect is an anaconda.Because it is venomous.] 
4. NULL

在此先感謝。

來源

2016-11-29 user7140275

請發佈你的'str_df'的[mcve]（http://stackoverflow.com/help/mcve）以及你想要的輸出。 –

@JulienMarrec編輯。謝謝。 – user7140275

你的第三個例子用'因爲'有兩個句子，這似乎你想要共同參考的分辨率，這是不容易的。如果你只需要提取句子，它就容易得多（即用！來分隔文本）。此外，請顯示您的當前輸出，即使它是錯誤的。 –

方法如下：您

In [1]: df['text'].apply(lambda text: [sent for sent in sent_tokenize(text) 
             if any(True for w in word_tokenize(sent) 
               if w.lower() in searched_words)]) 

0 [Snakes are venomous., Anaconda is venomous.] 
1 [Anaconda lives in Amazon.Amazon is a big forest., It is venomous.] 
2 [Snakes,snakes,snakes everywhere!, !The least I expect is an anaconda.Because it is venomous.] 
3 [] 
Name: text, dtype: object

看到，有一對夫婦的問題，因爲sent_tokenizer沒有做的工作，因爲正確的標點符號。

更新：處理複數。

這裏有一個更新的DF：

text 
Snakes are venomous. Anaconda is venomous. 
Anaconda lives in Amazon. Amazon is a big forest. It is venomous. 
Snakes,snakes,snakes everywhere! Mummyyyyyyy!!! The least I expect is an anaconda. Because it is venomous. 
Python is dangerous too. 
I have snakes 


df = pd.read_clipboard(sep='0')

我們可以使用一個詞幹（Wikipedia），如PorterStemmer。

from nltk.stem.porter import * 
stemmer = nltk.PorterStemmer()

首先，讓我們乾和小寫搜索詞：

searched_words = ['snakes','Venomous','anacondas'] 
searched_words = [stemmer.stem(w.lower()) for w in searched_words] 
searched_words 

> ['snake', 'venom', 'anaconda']

現在我們能做的改造上面，包括制止和：

print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text) 
          if any(True for w in word_tokenize(sent) 
            if stemmer.stem(w.lower()) in searched_words)])) 

0 [Snakes are venomous., Anaconda is venomous.] 
1 [Anaconda lives in Amazon., It is venomous.] 
2 [Snakes,snakes,snakes everywhere!, The least I expect is an anaconda., Because it is venomous.] 
3 [] 
4 [I have snakes] 
Name: text, dtype: object

如果你只想要子串匹配，確保searled_words是單數，而不是複數。

print(df['text'].apply(lambda text: [sent for sent in sent_tokenize(text) 
          if any([(w2.lower() in w.lower()) for w in word_tokenize(sent) 
            for w2 in searched_words]) 
           ]) 
)

順便說一句，這就是我可能會創建一個功能與普通for循環的角度來看，這與拉姆達列表內涵是失控的手。

來源

2016-11-29 09:56:44

謝謝你的工作。是啊，即使我遇到像「蛇也是有毒的.Python」這樣的問題。我預計輸出是[蛇是有毒的]，但是我得到了[蛇也是有毒的.Python]，因爲在句子的開頭沒有空格。 – user7140275

即使我在單詞列表中給出了「蛇」，是否還有一種方法可以用'蛇'來判斷句子。我所需要的是與指定詞相匹配的子字符串，以便我不會丟失任何數據來分析上下文。 – user7140275

是的，我會相應地更新 –

使用帶有特定單詞的熊貓提取語句

回答

相關問題