我有兩個數據框,看起來有點像下面(df1中的'內容'列實際上是一篇文章的全部內容,而不是,如在我的例子中,只有一個句子):Python:結合str.contains和合並在熊貓
PDF Content
1 1234 This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun!
2 1111 Johannes writes about apples and oranges and that's great.
3 8000 Content that cannot be matched to the anything in df1.
4 3993 There is an interesting piece on bananas plus kiwis as well.
...
(共5709個)
Author Title
1 Johannes Apples and oranges
2 Peter Bananas and pears and grapes
3 Hannah Bananas plus kiwis
4 Helena Mangos and peaches
...
(共10228項)
我想通過搜索 '標題' 從DF2在合併兩個dataframes 'C意圖'的df1。如果標題出現在的第一個2500個字符的內容中,則它是匹配的。 注意:重要的是保留來自df1的所有條目。相比之下,我只想保留匹配的df2條目(即左連接)。 注意:所有標題都是唯一值。
所需的輸出(列順序無所謂):
Author Title PDF Content
1 Peter Bananas and pears and grapes 1234 This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun!
2 Johannes Apples and oranges 1111 Johannes writes about apples and oranges and that's great.
3 NaN NaN 8000 Content that cannot be matched to the anything in df2.
4 Hannah Bananas plus kiwis 3993 There is an interesting piece on bananas plus kiwis as well.
...
我想我需要pd.merge和str.contains之間的組合,但我無法弄清楚如何!
你想要什麼行爲/期望如果有多個匹配? – ASGM
標題欄中的所有條目都是唯一的。關於內容列,我希望標題條目與內容條目中找到的第一個匹配相匹配。 – NynkeLys
「首次找到匹配」,如...?首先在數據集中(逐行)還是首先根據字符串中的位置? – ctwheels