2017-09-13 95 views
0

我正在使用sklearn.datasets.fetch_20newsgroups()數據集。這裏有一些文件屬於多個新聞組。我想把這些文件當作兩個不同的實體,每個實體都屬於一個新聞組。爲此,我將文檔ID和組名稱放入數據框中。熊貓 - 基於與列的關係更改列中的值

import sklearn 
from sklearn import datasets 
data = datasets.fetch_20newsgroups() 

filepaths = data.filenames.astype(str) 
keys = [] 
for path in filepaths: 
    keys.append(os.path.split(path)[1]) 

groups = pd.DataFrame(keys, columns = ['Document_ID']) 
groups['Group'] = data.target 
groups.head() 

>> Document_ID Group 
0 102994  7 
1 51861  4 
2 51879  4 
3 38242  1 
4 60880  14 

print (len(groups)) 
>>11314 
print (len(groups['Document_ID'].drop_duplicates())) 
>>9840 
print (len(groups['Group'].drop_duplicates())) 
>>20 

對於每個Document_ID,如果它指定了多個組編號,我想更改它的值。例如,

groups[groups['Document_ID']=='76139'] 

>> Document_ID Group 
5392 76139 6 
5680 76139 17 

我希望這成爲:

>> Document_ID Group 
5392 76139 6 
5680 12345 17 

在此,12345是一個隨機的新的ID,是不是已經在keys列表。

我該怎麼做?

回答

1

您可以找到所有包含重複Document_ID的行之後的第一個與duplicated方法。然後創建一個新ID的列表,以超過最大ID開始。使用loc索引操作符用新ID覆蓋重複鍵。

groups['Document_ID'] = groups['Document_ID'].astype(int) 
dupes = groups.Document_ID.duplicated(keep='first') 
max_id = groups.Document_ID.max() + 1 
new_id = range(max_id, max_id + dupes.sum()) 
groups.loc[dupes, 'Document_ID'] = new_id 

測試用例

groups.loc[[5392,5680]] 

     Document_ID Group 
5392  76139  6 
5680  179489  17 

確保沒有重複存在。

groups.Document_ID.duplicated(keep='first').any() 
False 
0

有點哈克,但爲什麼不呢!

data = {"Document_ID": [102994,51861,51879,38242,60880,76139,76139], 
    "Group": [7,1,3,4,4,6,17], 
    } 
groups = pd.DataFrame(data) 

groupDict ={} 
tempLst=[] 

#Create a list of unique ID's 
DocList = groups['Document_ID'].unique() 
DocList.tolist() 

#Build a dictionary and push all group ids to the correct doc id 
DocDict = {} 
for x in DocList: 
    DocDict[x] = [] 

for index, row in groups.iterrows(): 
    DocDict[row['Document_ID']].append(row['Group']) 
#For all doc Id's with multip entries create a new id with the group id as a decimal point. 
groups['DupID'] = groups['Document_ID'].apply(lambda x: len(DocDict[x])) 
groups["Document_ID"] = np.where(groups['DupID'] > 1, groups["Document_ID"] + groups["Group"]/10,groups["Document_ID"]) 

希望幫助...