刪除具有太多無效/缺失值的變量

說我的數據集有很多丟失/無效的值，如果包含太多無效值（或列），我想刪除（或刪除）整個變量值。刪除具有太多無效/缺失值的變量

以下例子中，變量'gender'有很多「#N/A」。如果某個百分比的數據點有「＃N/A」，比如超過50％，超過30％，我想刪除該變量。另外，我想使百分比成爲一個可配置的值，即如果該變量下的觀察值超過x％爲「＃N/A」，我願意刪除整個變量。而且我也希望能夠定義什麼是無效值，可以是「＃N/A」，可以是「無效值」，可以是「」，也可以是我預先定義的任何其他值。

data dat; 
    input id score gender $; 
    cards; 
    1 10 1 
    1 10 1 
    1 9 #N/A 
    1 9 #N/A 
    1 9 #N/A 
    1 8 #N/A 
    2 9 #N/A 
    2 8 #N/A 
    2 9 #N/A 
    2 9 2 
    2 10 2 
    ; 
run;

請儘可能地概括解決方案。例如，如果真正的數據集包含數千個變量，我需要能夠遍歷所有這些變量，而不是逐個引用它們的變量名稱。此外，數據集可能不僅包含「＃N/A」作爲不良值，還包含諸如「。」，「無效觀察」，「N.A.」之類的內容。也可以同時存在。 PS：其實我想到了一個讓這個問題更容易的方法。我們可以讀取所有的數據點作爲數值，這樣所有的「＃N/A」，「N.A.」，「」東西都會變成「。」，這使得下降標準更容易。希望能幫助你解決這個問題...

更新：下面是我工作的代碼。被困在最後一塊。

data dat; 
    input id $ score $ gender $; 
    cards; 
    1 10 1 
    1 10 1 
    1 9 #N/A 
    1 9 #N/A 
    1 9 #N/A 
    1 8 #N/A 
    2 9 #N/A 
    2 8 #N/A 
    2 9 #N/A 
    2 9 2 
    2 10 2 
    ; 
run; 

proc contents data=dat out=test0(keep=name type) noprint; 

/*A DATA step is used to subset the test0 data set to keep only the character */ 
/*variables and exclude the one ID character variable. A new list of numeric*/ 
/*variable names is created from the character variable name with a "_n"  */ 
/*appended to the end of each name.           */               

data test0;             
set test0;             
if type=2;     
newname=trim(left(name))||"_n";                    

/*The macro system option SYMBOLGEN is set to be able to see what the macro*/ 
/*variables resolved to in the SAS log.         */              

options symbolgen;           

/*PROC SQL is used to create three macro variables with the INTO clause. One */ 
/*macro variable named c_list will contain a list of each character variable */ 
/*separated by a blank space. The next macro variable named n_list will  */ 
/*contain a list of each new numeric variable separated by a blank space. The */ 
/*last macro variable named renam_list will contain a list of each new numeric */ 
/*variable and each character variable separated by an equal sign to be used on*/ 
/*the RENAME statement.              */               

proc sql noprint;           
select trim(left(name)), trim(left(newname)),    
     trim(left(newname))||'='||trim(left(name))   
into :c_list separated by ' ', :n_list separated by ' ', 
    :renam_list separated by ' '       
from test0; 
quit;                            


/*The DATA step is used to convert the numeric values to character. An ARRAY */ 
/*statement is used for the list of character variables and another ARRAY for */ 
/*the list of numeric variables. A DO loop is used to process each variable */ 
/*to convert the value from character to numeric with the INPUT function. The */ 
/*DROP statement is used to prevent the character variables from being written */ 
/*to the output data set, and the RENAME statement is used to rename the new */ 
/*numeric variable names back to the original character variable names.  */               

data test2;            
set dat;             
array ch(*) $ &c_list;          
array nu(*) &n_list;          
do i = 1 to dim(ch);          
    nu(i)=input(ch(i),8.);         
end;              
drop i &c_list;           
rename &renam_list;                      
run; 




data test3;            
set test2;             
array myVars(*) &c_list;        
countTotal=1; 
do i = 1 to dim(myVars); 
    myCounter = count(.,myVars(i)); 
/* if sum(countMissing)/sum(countTotal) lt 0.5 then drop VNAME(myVars(i)); */ 
end; 

run;

問題是，而我被卡住的地方，是我無法刪除我想要放下的變量。原因是因爲我不想在drop函數中使用變量名稱。相反，我希望在一個循環中完成，我可以用循環「i」引用變量名稱。我試圖使用數組「myVars（i）」，但它似乎不起作用的下降功能。

來源

2015-10-20 user4564894

堆棧溢出不是代碼生成服務。您應該嘗試解決這個問題，並回過頭來解答有關您的解決方案的問題 - 而不僅僅是要求解決重大問題。 – Joe

我同意喬 - 你似乎已經對你想做什麼有一個相當清晰的想法，所以先去做一下吧。如果你陷入某個特定的步驟，那麼通過一切手段發佈你的代碼並尋求幫助。 – user667489

現在，我提供了更多詳細信息和代碼，請刪除大拇指，因爲我不再需要代碼生成服務，@Joe – user4564894

在一般情況下，你會發現這樣的事情簡化使用內置的特效 - 這是SAS的麪包和黃油。你只需要重申這個問題。

你想要的是丟失變量的缺失/錯誤數據的百分比高於50％，所以你需要一個頻率表的變量，對吧？

所以 - 使用PROC FREQ。這是簡化版本（僅查找「＃N/A」），但應該很容易修改最後一步，以便查找其他值（並總結它們的百分比）。或者，就像你會在鏈接問題中看到的（從我對這個問題的評論中），你可以使用一種特殊格式，將所有無效值設置爲一個格式化值，將所有有效值設置爲另一個格式化值。（您必須構建此格式。）

概念：使用PROC FREQ獲取頻率表，然後查看該數據集以查找行數> 50％且F_列中的值無效的行。

這不適用於實際丟失（「」或。）;如果您還有這些，則需要將/MISSING選項添加到PROC FREQ。

data dat; 
    input id $ score $ gender $; 
    cards; 
    1 10 1 
    1 10 1 
    1 9 #N/A 
    1 9 #N/A 
    1 9 #N/A 
    1 8 #N/A 
    2 9 #N/A 
    2 8 #N/A 
    2 9 #N/A 
    2 9 2 
    2 10 2 
    ; 
run; 

*shut off ODS for the moment, and only use ODS OUTPUT, so we do not get a mess in our results window; 
ods exclude all; 
ods output onewayfreqs=freq_tables; 
proc freq data=dat; 
    tables id score gender; 
run; 
ods output close; 
ods exclude none; 

*now we check for variables that match our criteria;  
data has_missing; 
    set freq_tables; 
    if coalescec(of f_:) ='#N/A' and percent>50; 
    varname = substr(table,7); 
run; 

*now we put those into a macro variable to drop; 
proc sql; 
    select varname 
    into :droplist separated by ' ' 
    from has_missing; 
quit; 

*and we drop them; 
data dat_fixed; 
    set dat; 
    drop &droplist.; 
run;

來源

2015-10-20 19:49:32 Joe

謝謝喬和你肯定是一個職業。 – user4564894

我在第一次看到這個問題時想過freq，但後來努力從freq輸出表中提取出我需要的確切信息，而沒有了解像substr，coalescec等功能。一個簡單的問題，在代碼的第二行，爲什麼在＆droplist的末尾添加了一個小點？如果我將百分比確定得很高，那麼最終的輸出結果仍然會以＃N/A的形式刪除列。 – user4564894

我遇到的一個問題是，如果數據集has_missiong爲空，則varname變空，導致droplist無法解析。 – user4564894

我的理解是，SAS在數據步驟編譯期間處理掉語句，即在查看任何輸入數據集中的任何數據之前。因此，不能使用像這樣的vname函數來選擇要刪除的變量，因爲它不計算變量名稱，直到數據步驟完成編譯並且已轉移到執行爲止。

您將需要輸出一個臨時數據集或視圖，其中包含您不想要的所有變量（包括您不想要的變量），然後將它們放入一個宏變量中，然後將它們放入一個宏變量中後續的數據步驟。

參閱本文以及第3頁，特別是對於更細節的編譯過程中運行的東西，而不是執行：

http://www.lexjansen.com/nesug/nesug11/ds/ds04.pdf

來源

2015-10-20 19:12:39 user667489

感謝您的意見。這是有道理的。我正在查看你附加的鏈接，我會回來與我的發現:) – user4564894

刪除具有太多無效/缺失值的變量

回答

相關問題