2017-10-18 123 views
2

我有兩個數據框,看起來有點像下面(df1中的'內容'列實際上是一篇文章的全部內容,而不是,如在我的例子中,只有一個句子):Python:結合str.contains和合並在熊貓

PDF  Content 
1 1234 This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun! 
2 1111 Johannes writes about apples and oranges and that's great. 
3 8000 Content that cannot be matched to the anything in df1.  
4 3993 There is an interesting piece on bananas plus kiwis as well. 
    ... 

(共5709個)

Author  Title 
1 Johannes  Apples and oranges 
2 Peter   Bananas and pears and grapes 
3 Hannah  Bananas plus kiwis 
4 Helena  Mangos and peaches 
    ... 

(共10228項)

我想通過搜索 '標題' 從DF2在合併兩個dataframes 'C意圖'的df1。如果標題出現在的第一個2500個字符的內容中,則它是匹配的。 注意:重要的是保留來自df1的所有條目。相比之下,我只想保留匹配的df2條目(即左連接)。 注意:所有標題都是唯一值。

所需的輸出(列順序無所謂):

Author  Title      PDF  Content 
1 Peter  Bananas and pears and grapes 1234 This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun! 
2 Johannes Apples and oranges   1111 Johannes writes about apples and oranges and that's great. 
3 NaN  NaN       8000 Content that cannot be matched to the anything in df2.  
4 Hannah  Bananas plus kiwis   3993 There is an interesting piece on bananas plus kiwis as well. 
    ... 

我想我需要pd.merge和str.contains之間的組合,但我無法弄清楚如何!

+1

你想要什麼行爲/期望如果有多個匹配? – ASGM

+0

標題欄中的所有條目都是唯一的。關於內容列,我希望標題條目與內容條目中找到的第一個匹配相匹配。 – NynkeLys

+0

「首次找到匹配」,如...?首先在數據集中(逐行)還是首先根據字符串中的位置? – ctwheels

回答

0

警告:解決方案可能會很慢:)。
1.獲取列表的標題
2.創建基於標題列表順序
3. CONCAT DF1和DF2的IDX

lst = [item.lower() for item in df2.Title.tolist()] 
    end = len(lst) 
    def func(row): 
    content = row[:2500].lower() 
    for i, item in enumerate(lst): 
     if item in content: 
     return i 
    end += 1 
    return end 
    df1 = df1.assign(idx=df1.Content.apply(func)) 

    res = pd.concat([df1.set_index('idx'), df2], axis=1) 

輸出

 PDF           Content Author \ 
0 1111.0 Johannes writes about apples and oranges and t... Johannes 
1 1234.0 This article is about bananas and pears and gr...  Peter 
2 3993.0 There is an interesting piece on bananas plus ... Hannah 
3  NaN            NaN Helena 
4 8000.0 Content that cannot be matched to the anything...  NaN 

          Title 
0   Apples and oranges 
1 Bananas and pears and grapes 
2   Bananas plus kiwis 
3   Mangos and peaches 
4       NaN 
+0

即使最初,我也會得到以下錯誤:兩個數據幀只有非空對象: ---------------------------- ----------------------------------------------- AttributeError Traceback (最近呼叫的最後一個) in () 2#在第二個df的前2500個字符中。 ----> 4 lst = [item.lower()用於df2.Title中的項目。tolist()] 5 end = len(lst) 6 def func(row): AttributeError:'float'對象沒有屬性'lower'。 有什麼想法? – NynkeLys

+0

@NynkeLys將內容更改爲str – galaxyan

+0

我使用以下命令,但仍得到相同的錯誤:df1.Content = df1.Content.astype('str') – NynkeLys

0

你可以做DF1指數完整的笛卡爾連接/交叉產品,然後過濾。既然你不能做一個哈希查找,它不應該有任何比同等慢「加入」的聲明:

df1['key'] = 1 
df2['key'] = 2 
df3 = pd.merge(df1, df2, on='key') 
df3['key'] = df3.apply(lambda row: row['Title'].lower() in row['Content'][:2500].lower(), axis=1) 
df3 = df3.loc[df3['key'], ['PDF', 'Author', 'Title', 'Content']] 

其產生表:

 PDF Author       Title \ 
0 1234.0 Johannes   Apples and oranges 
1 1234.0  Peter Bananas and pears and grapes 
4 1111.0 Johannes   Apples and oranges 
14 3993.0 Hannah   Bananas plus kiwis 

               Content 
0 This article is about bananas and pears and gr... 
1 This article is about bananas and pears and gr... 
4 Johannes writes about apples and oranges and t... 
14 There is an interesting piece on bananas plus ... 
+0

謝謝!我試過了,但得到了以下錯誤:ValueError:無法設置沒有定義索引的框架和無法轉換爲Series的值。任何想法? – NynkeLys

+0

有什麼想法?運行你的代碼會產生一個不斷的錯誤我使用Python 2.7,即使使用與我爲我的問題創建的dfs完全相同的dfs。 – NynkeLys