如何刪除具有重複名稱但保留數據的列

我正在使用熊貓數據框作爲屬性爲英語單詞的數據集。詞幹後，我有多個同名的列。這裏是樣本數據snap，在詞幹之後，accept, acceptable and accepted變成accept。我想在所有具有相同名稱的列上使用bitwise_or並刪除重複的列。我想這個代碼如何刪除具有重複名稱但保留數據的列

import numpy 
from nltk.stem import * 
import pandas as pd 
ps = PorterStemmer() 
dataset = pd.read_csv('sampleData.csv') 
stemmed_words = [] 

for w in list(dataset): 
    stemmed_words.append(ps.stem(w)) 

dataset.columns = stemmed_words 
new_word = stemmed_words[0] 

for w in stemmed_words: 
    if new_word == w: 
     numpy.bitwise_or(dataset[new_word], dataset[w]) 
     del dataset[w] 
    else: 
     new_word = w 

print(dataset)

的問題是，for循環執行

del dataset['accept']

當它刪除所有列這個名字。我不知道有多少列將具有相同的名稱，並且此代碼會生成一個異常KeyError：'accept'

我想在所有三個accept列上應用bitwise_or，將其保存到名爲'接受'並刪除舊的列。

我希望我不會downvoted這個時候

這裏是樣本數據：

able abundance academy accept accept accept access accommodation accompany Class 
    0   0  0  0  0  1  1    0   0  C 
    0   0  0  1  0  0  0    0   0  A 
    0   0  0  0  1  0  0    0   0  H 
    0   0  0  0  0  1  0    1   0  G 
    0   0  0  1  0  0  0    0   0  G

輸出應該

Class able abundance academy accept access accommodation accompany 
    C  0   0  0  1  1    0   0 
    A  0   0  0  1  0    0   0 
    H  0   0  0  1  0    0   0 
    G  0   0  0  1  0    1   0 
    G  0   0  0  1  0    0   0

來源

2017-05-07 Abrar

IIUC你可以通過列名小組（axis=1 ）。

數據幀：

In [101]: df 
Out[101]: 
    able abundance academy accept accept accept access accommodation accompany Class 
0  0   0  0  0  0  1  1    0   0  C 
1  0   0  0  1  0  0  0    0   0  A 
2  0   0  0  0  1  0  0    0   0  H 
3  0   0  0  0  0  1  0    1   0  G 
4  0   0  0  1  0  0  0    0   0  G

解決方案：

In [103]: df.pop('Class').to_frame() \ 
    ...: .join(df.groupby(df.columns, axis=1).any(1).mul(1)) 
Out[103]: 
    Class able abundance academy accept access accommodation accompany 
0  C  0   0  0  1  1    0   0 
1  A  0   0  0  1  0    0   0 
2  H  0   0  0  1  0    0   0 
3  G  0   0  0  1  0    1   0 
4  G  0   0  0  1  0    0   0

甚至更好的解決方案（@ayhan, thank you for the hint!）：

In [114]: df = df.pop('Class').to_frame().join(df.groupby(df.columns, axis=1).max()) 

In [115]: df 
Out[115]: 
    Class able abundance academy accept access accommodation accompany 
0  C  0   0  0  1  1    0   0 
1  A  0   0  0  1  0    0   0 
2  H  0   0  0  1  0    0   0 
3  G  0   0  0  1  0    1   0 
4  G  0   0  0  1  0    0   0

來源

2017-05-07 11:13:31 MaxU

你能解釋一下這種方法更多一點？它沒有提供期望的輸出。它不會將同名的列分組。我用你的'df.groupby（df.columns，axis = 1）.any（1）.mul（1）' – Abrar

@Abrar替換了OP中的for循環，請提供一個小的__reproducible__樣本（3-5行）數據集（文本/ CSV格式 - 所以我們可以複製和粘貼它）和所需的數據集[在你的問題]（http://stackoverflow.com/posts/43830707/edit） – MaxU

@MaxU我認爲，而不是多OP正在尋找groupby.sum（因爲它們是二進制的，它們的總和將表現爲'any' - 1，如果它們中的任何一個是1的話）。 – ayhan

如何刪除具有重複名稱但保留數據的列

回答

相關問題