檢查列名是否存在

我有一個數據框df，其中包含一系列年份的許多字段名稱。檢查列名是否存在

            field 
year description            
1993 bar0          a01arb92 
    bar1          a01svb92 
    bar2          a01fam92 
    bar3          a08 
    bar4          a01bea93

然後，對於每一年，我有了id在df提到的字段名稱的列和其他列，部分（或全部）一STATA文件。例如，1993.dta可能是

id a01arb92 a01svb92 a08 a01bea93 
0   1  1 1  1 
0   1  1 1  2

我需要在相應的文件來檢查，每年如果真的存在df列出的所有字段（如列）。然後我想將結果保存回原始數據框中。有沒有一種很好的方式來做到這一點，而不是遍歷每一個領域？

預期輸出：

            field exists 
year description            
1993 bar0          a01arb92  1 
    bar1          a01svb92  1 
    bar2          a01fam92  0 
    bar3          a08    1 
    bar4          a01bea93  1

例如，如果每一個場，但在a01fam92 1993年文件作爲列存在。

來源

2014-10-27 FooBar

儘量每年都去，過濾數據框以獲取與每個特定年份相關聯的字段，然後比較元素是否在stata文件中或不是。

讀取Stata的文件使用read_stata：

import pandas as pd 
d= pd.stata.read_stata("file")

閱讀您的CSV文件，並將其存儲在數據幀

import pandas as pd 
df= pd.read_csv("file")

過濾和提取每年的字段。

df[df["year"]==1993].fields #Output: List of fields in year 1993

您可以通過經歷多年

l= df.year 
for x in l: 
    f= df[df["year"]==x].fields 
    # Then check if f in strata file.

列表，這裏您將瞭解如何filter fields using Pandas詳細解釋generlize過程。

比較列表starata領域你有

可以使用All()操作。

All(item for item in f if item in d)

如果它是真的，那麼該字段中的所有元素都在分層文件中。

使一個功能。

l= df.year #List of years 
IsInDic={} #Dictinary to store a year:<All Fields in stata field> eg: {1993:True} 
for x in l: 
    f= df[df["year"]==x].fields 
    # Then check if f in strata file. 
    isInList= All(item for item in f if item in d) 
    IsInDic[x]=isInList #Add everything in a dictionary to help you later decide whether it's true or no.

UPDATE

def isInList(x): 
    return [ x for x in d if x in df[df["year"]==x].fields] == d

來源

2014-10-28 01:20:28

那麼，這就是我最初的想法。但是，它遍歷每個文件，然後在將其保存爲字典後，我會假設必須將其迭代到原始數據框上。沒有辦法使用'df'和'd'都是數據幀的事實嗎？ – FooBar 2014-10-28 16:55:38

@FooBar檢查更新。如果我們可以使用過濾呢我們創建一個過濾列表，在d中添加每個元素（如果它在字段中），然後將結果與d進行比較。如果我們得到相同的列表，則意味着所有元素都在字段中，在相反的情況下是錯誤的。 – 2014-10-29 13:30:51

我認爲你的更新應該讀取'return [...] == df [df [「year」] == x] .fields'。但是，我只知道它是否包含*全部*的字段。爲了恢復問題中的預期輸出，我仍然需要遍歷所有的字段，不是嗎？ – FooBar 2014-10-29 14:48:57

這裏是一個辦法做到這一點利用的事實，熊貓會自動填入NaN的缺失指數。

首先準備數據。您可能已經完成了這一步。

df1 = pd.read_csv(r'c:\temp\test1.txt', sep=' ') 

df1 
Out[30]: 
    year description  field 
0 1993  bar0 a01arb92 
1 1993  bar1 a01svb92 
2 1993  bar2 a01fam92 
3 1993  bar3  a08 
4 1993  bar4 a01bea93 

df1 = df1.set_index(['year', 'description', 'field']) 

df2 = pd.read_csv(r'c:\temp\test2.txt', sep=' ') 

df2 
Out[33]: 
    year description  field 
0 1993  bar0 a01arb92 
1 1993  bar1 a01svb92 
2 1993  bar3  a08 
3 1993  bar4 a01bea93 

df2 = df2.set_index(['year', 'description', 'field'])

接下來，創建於DF2和使用熊貓新的列到列複製到以前的數據幀。這將填補NaN缺失的值。然後用fillna來賦值0.

df2['exists'] = 1 

df1['exists'] = df2['exists'] 

df1 
Out[37]: 
          exists 
year description field   
1993 bar0  a01arb92  1 
    bar1  a01svb92  1 
    bar2  a01fam92  NaN 
    bar3  a08   1 
    bar4  a01bea93  1 

df1.fillna(0) 
Out[38]: 
          exists 
year description field   
1993 bar0  a01arb92  1 
    bar1  a01svb92  1 
    bar2  a01fam92  0 
    bar3  a08   1 
    bar4  a01bea93  1

來源

2014-10-29 05:25:10

感謝您的回答。看起來我的問題很不明確：'df2'與'df1'沒有相同的結構，它有'df1'中列出的'fields'作爲列。我更新了這個問題，我希望有所幫助。 – FooBar 2014-10-29 14:43:42

檢查列名是否存在

回答

相關問題