Python：結合str.contains和合並在熊貓

我有兩個數據框，看起來有點像下面（df1中的'內容'列實際上是一篇文章的全部內容，而不是，如在我的例子中，只有一個句子）：Python：結合str.contains和合並在熊貓

PDF  Content 
1 1234 This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun! 
2 1111 Johannes writes about apples and oranges and that's great. 
3 8000 Content that cannot be matched to the anything in df1.  
4 3993 There is an interesting piece on bananas plus kiwis as well. 
    ...

（共5709個）

Author  Title 
1 Johannes  Apples and oranges 
2 Peter   Bananas and pears and grapes 
3 Hannah  Bananas plus kiwis 
4 Helena  Mangos and peaches 
    ...

（共10228項）

我想通過搜索 '標題' 從DF2在合併兩個dataframes 'C意圖'的df1。如果標題出現在的第一個2500個字符的內容中，則它是匹配的。注意：重要的是保留來自df1的所有條目。相比之下，我只想保留匹配的df2條目（即左連接）。注意：所有標題都是唯一值。

所需的輸出（列順序無所謂）：

Author  Title      PDF  Content 
1 Peter  Bananas and pears and grapes 1234 This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun! 
2 Johannes Apples and oranges   1111 Johannes writes about apples and oranges and that's great. 
3 NaN  NaN       8000 Content that cannot be matched to the anything in df2.  
4 Hannah  Bananas plus kiwis   3993 There is an interesting piece on bananas plus kiwis as well. 
    ...

我想我需要pd.merge和str.contains之間的組合，但我無法弄清楚如何！

來源

2017-10-18 NynkeLys

你想要什麼行爲/期望如果有多個匹配？ – ASGM

標題欄中的所有條目都是唯一的。關於內容列，我希望標題條目與內容條目中找到的第一個匹配相匹配。 – NynkeLys

「首次找到匹配」，如...？首先在數據集中（逐行）還是首先根據字符串中的位置？ – ctwheels

警告：解決方案可能會很慢:)。
1.獲取列表的標題
2.創建基於標題列表順序
3. CONCAT DF1和DF2的IDX

lst = [item.lower() for item in df2.Title.tolist()] 
    end = len(lst) 
    def func(row): 
    content = row[:2500].lower() 
    for i, item in enumerate(lst): 
     if item in content: 
     return i 
    end += 1 
    return end 
    df1 = df1.assign(idx=df1.Content.apply(func)) 

    res = pd.concat([df1.set_index('idx'), df2], axis=1)

輸出

 PDF           Content Author \ 
0 1111.0 Johannes writes about apples and oranges and t... Johannes 
1 1234.0 This article is about bananas and pears and gr...  Peter 
2 3993.0 There is an interesting piece on bananas plus ... Hannah 
3  NaN            NaN Helena 
4 8000.0 Content that cannot be matched to the anything...  NaN 

          Title 
0   Apples and oranges 
1 Bananas and pears and grapes 
2   Bananas plus kiwis 
3   Mangos and peaches 
4       NaN

來源

2017-10-18 16:12:37 galaxyan

即使最初，我也會得到以下錯誤：兩個數據幀只有非空對象： ---------------------------- ----------------------------------------------- AttributeError Traceback （最近呼叫的最後一個） in （） 2＃在第二個df的前2500個字符中。 ----> 4 lst = [item.lower（）用於df2.Title中的項目。tolist（）] 5 end = len（lst） 6 def func（row）： AttributeError：'float'對象沒有屬性'lower'。有什麼想法？ – NynkeLys

@NynkeLys將內容更改爲str – galaxyan

我使用以下命令，但仍得到相同的錯誤：df1.Content = df1.Content.astype（'str'） – NynkeLys

你可以做DF1指數完整的笛卡爾連接/交叉產品，然後過濾。既然你不能做一個哈希查找，它不應該有任何比同等慢「加入」的聲明：

df1['key'] = 1 
df2['key'] = 2 
df3 = pd.merge(df1, df2, on='key') 
df3['key'] = df3.apply(lambda row: row['Title'].lower() in row['Content'][:2500].lower(), axis=1) 
df3 = df3.loc[df3['key'], ['PDF', 'Author', 'Title', 'Content']]

其產生表：

 PDF Author       Title \ 
0 1234.0 Johannes   Apples and oranges 
1 1234.0  Peter Bananas and pears and grapes 
4 1111.0 Johannes   Apples and oranges 
14 3993.0 Hannah   Bananas plus kiwis 

               Content 
0 This article is about bananas and pears and gr... 
1 This article is about bananas and pears and gr... 
4 Johannes writes about apples and oranges and t... 
14 There is an interesting piece on bananas plus ...

來源

2017-10-18 16:25:02 scnerd

謝謝！我試過了，但得到了以下錯誤：ValueError：無法設置沒有定義索引的框架和無法轉換爲Series的值。任何想法？ – NynkeLys

有什麼想法？運行你的代碼會產生一個不斷的錯誤我使用Python 2.7，即使使用與我爲我的問題創建的dfs完全相同的dfs。 – NynkeLys

Python：結合str.contains和合並在熊貓

回答

相關問題