2010-04-10 216 views
5

我有我試圖用GGPLOT2繪製下面的數據集,它是一個時間序列的三個實驗A1,B1和C1和每個實驗有三個重複。R:如何從平滑ggplot2中刪除異常值?

我想添加一個stat,它可以在返回更平滑(平均值和方差?)之前檢測並刪除異常值。我寫了自己的離羣值函數(未顯示),但我認爲已經有一個函數可以做到這一點,我只是沒有找到它。

我已經看了stat_sum_df(「median_hilow」,GEOM =「平滑」)從GGPLOT2書中的一些例子,但我不理解Hmisc的幫助文檔,看它是否刪除異常與否。

是否有一個函數在ggplot中刪除這樣的異常值,或者我會在下面修改我的代碼以添加我自己的函數?

編輯:我剛纔看到了這個(How to use Outlier Tests in R Code),並注意到哈德利建議使用穩健的方法,如rlm。我正在繪製細菌生長曲線,所以我不認爲線性模型是最好的,但對於其他模型或在這種情況下使用或使用健壯模型的建議將不勝感激。

library (ggplot2) 

data = data.frame (day = c(1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7), od = 
c(
0.1,1.0,0.5,0.7 
,0.13,0.33,0.54,0.76 
,0.1,0.35,0.54,0.73 
,1.3,1.5,1.75,1.7 
,1.3,1.3,1.0,1.6 
,1.7,1.6,1.75,1.7 
,2.1,2.3,2.5,2.7 
,2.5,2.6,2.6,2.8 
,2.3,2.5,2.8,3.8), 
series_id = c(
"A1", "A1", "A1","A1", 
"A1", "A1", "A1","A1", 
"A1", "A1", "A1","A1", 
"B1", "B1","B1", "B1", 
"B1", "B1","B1", "B1", 
"B1", "B1","B1", "B1", 
"C1","C1", "C1", "C1", 
"C1","C1", "C1", "C1", 
"C1","C1", "C1", "C1"), 
replicate = c(
"A1.1","A1.1","A1.1","A1.1", 
"A1.2","A1.2","A1.2","A1.2", 
"A1.3","A1.3","A1.3","A1.3", 
"B1.1","B1.1","B1.1","B1.1", 
"B1.2","B1.2","B1.2","B1.2", 
"B1.3","B1.3","B1.3","B1.3", 
"C1.1","C1.1","C1.1","C1.1", 
"C1.2","C1.2","C1.2","C1.2", 
"C1.3","C1.3","C1.3","C1.3")) 

> data 
    day od series_id replicate 
1 1 0.10  A1  A1.1 
2 3 1.00  A1  A1.1 
3 5 0.50  A1  A1.1 
4 7 0.70  A1  A1.1 
5 1 0.13  A1  A1.2 
6 3 0.33  A1  A1.2 
7 5 0.54  A1  A1.2 
8 7 0.76  A1  A1.2 
9 1 0.10  A1  A1.3 
10 3 0.35  A1  A1.3 
11 5 0.54  A1  A1.3 
12 7 0.73  A1  A1.3 
13 1 1.30  B1  B1.1 
... etc... 

這是我到目前爲止,並很好地工作,但異常不會被刪除:

r <- ggplot(data = data, aes(x = day, y = od)) 
r + geom_point(aes(group = replicate, color = series_id)) + # add points 
    geom_line(aes(group = replicate, color = series_id)) + # add lines 
    geom_smooth(aes(group = series_id)) # add smoother, average of each replicate 

編輯:我只是說低於我是離羣值問題的例子顯示兩個圖表具有真實的數據而不是上面的示例數據。

第一張圖顯示系列p26s4,第32天左右在兩個重複中出現了一些非常奇怪的現象,顯示了2個異常值。

第二張圖顯示系列p22s5,在第18天,當天的閱讀有些奇怪,我想可能是機器錯誤。

目前我正在仔細觀察數據,以檢查增長曲線是否正常。在考慮了哈德利的建議並設置了家庭=「對稱」之後,我相信黃土平滑者在忽略異常值方面做得不錯。

p26s4 shows around day 32 something really weird went on in two of the replicates, showing 2 outliers http://img696.imageshack.us/img696/8743/p26s4loess.png p22s5 shows that on day 18, something weird went on with the reading that day, likely machine error I think http://img521.imageshack.us/img521/8083/p22s5loess.png

@彼得/ @哈德利,我想這樣做的下一件事就是嘗試和適合物流,姜氏或理查德的生長曲線,以這個數據來代替黃土和計算增長率在指數階段。最終我打算在R(http://cran.r-project.org/web/packages/grofit/index.html)中使用grofit包,但現在我想用ggplot2手動繪製這些圖表,如果可能的話。如果你有任何指針,那麼將非常感激。

回答

14

你有沒有試過family = "symmetric"的參數geom_smooth(這又會傳遞給loess)?這將使黃土順利抵抗異常值。

然而,看看你的數據,你爲什麼認爲線性擬合不夠?你只有4個x值,並且似乎沒有強有力的線性背離的證據。

+0

我得到'錯誤:未知參數:family'當我嘗試這一點。 – JayCo 2016-06-21 23:55:38

+1

想通了!正確的語法是'geom_smooth(method = loess,method.args = list(family =「symmetric」))' – JayCo 2016-06-22 00:06:19

2

首先,我不確定在這樣的小數據上甚至可以正確定義「異常值」。第二,你需要決定「異常值」是什麼意思,也就是說,它是藥物,複製品之一,還是其中一個時間點?正如哈德利指出的那樣,幾乎沒有證據表明線性偏離。

最後,我認爲使用平滑器的一部分是它可以很好地處理異常值,前提是有足夠的數據。但你很少。

所以,我必須問清楚爲什麼要刪除異常值。也就是說,你將如何處理這些數據(除了做出好的情節)?

我希望這有助於