1
我有一個相對簡單的任務,一個問題...合併收益奇數長度
我有兩個dataframes: df_sample
這是我從CSV
+------+-----------+-------+-----------+
| key | Full Text | Date | Publisher |
+------+-----------+-------+-----------+
| abcd | foofoo | date1 | a |
| bcde | barbar | date2 | b |
| cdef | foobar | date3 | c |
+------+-----------+-------+-----------+
len(df_sample) = 20000
df_labels
我從Excel
+------+----------+--------+--------+
| key | relevant | other | other2 |
+------+----------+--------+--------+
| abcd | yes | blabla | blabla |
| bcde | no | blabla | blabla |
| cdef | no | blabla | blabla |
| defg | yes | blabla | blabla |
+------+----------+--------+--------+
len(df_labels) = 219000
我想加入這兩個表的密鑰分配relevant
值從第一個密鑰數據幀。所需的輸出會是這樣的:
+------+-----------+-------+-----------+----------+
| key | Full Text | Date | Publisher | relevant |
+------+-----------+-------+-----------+----------+
| abcd | foofoo | date1 | a | yes |
| bcde | barbar | date2 | b | no |
| cdef | foobar | date3 | c | no |
+------+-----------+-------+-----------+----------+
我似乎做到這一點,但爲什麼要在下面給我27377分的結果,而不是20000(在原左表):
df = pd.merge(left=df_sample, right=df_labels, on="key")
你是否檢查過,鍵列值在第二個df中是唯一的,如果它們重複,那麼你得到重複的行,另外你是否有'NaN'關鍵列? – EdChum
當然,在第二個df有一些重複...非常感謝我指出了正確的方向! – pawelty