2017-01-23 242 views
0

我是比較新的R,因此任何幫助理解在r中,如何評估

樣本數據集基於其他載體上的兩種載體,也附接在數據集的圖像。 image is of sample dataset

a   B   C   D 
12.97221, 64.78909 1   2 
69.64817, 321.90037 2   28 
318.87946, 259.29946 3   5 
326.17622, 94.7089  9   8 
137.54006, 325.34917 5   88 
258.06002, 94.77531 6   63 
258.92824, 322.20164 7   64 
98.57514, 12.96828 8   34 
98.46303, 139.27264 9   21 
317.22764, 261.25563 10   97 

我的目標:我需要

1) look at value in column A 
2) find the nearest/closest number in column B 
3) test to see if the value in column B has already been selected 
4) if the value in column B has already been selected, then ignore and choose the next closest value. 
5) once a new, non-duplicated, value is chosen from column B, then 
6) Test to see if the value in column C that is on the same row as the value of interest in column A is not the same as the value in column D on the same row as the nearest chosen value in column B 
7) if the values in column C and D are NOT the same, then 
8) return the value from column B into a new column 
9) if the value in column C and D are the same, then repeat steps 4-7 until a) a new, non-duplicated value is chosen, and b) the value in C and D are not equal. 

這裏是代碼我到目前爲止,這段代碼解決了查找最接近的數字「無需更換」的問題,但沒有解決問題在列B中的值被選擇之前考慮列C和D中的相似值;通過「大通」從這裏發展:How to get the closest element in a vector for every element in another vector without duplicates?

foo <- function(a,b) { 

    out <- cbind(a, bval = NA) 

    for (i in seq_along(a)) { 
    #which value of B is closest? 
    whichB <- which.min(abs(b - a[i])) 
    #Assign that value to the bval column 
    out[i, "bval"] <- b[whichB] 
    #Remove that value of B from being chosen again 
    b <- b[-whichB] 
    } 

    return(out) 

希望這(下圖)是我的問題的一個更好的描述和示例。

看到調整表,以更好地顯示我的問題。查看12.97221列A中的值,然後評估B列並選擇值12.96828。然後評估列C中對應於12.97221的值,即1;然後查看對應於12.96828的D列中的值(d = 34中的值)。由於未選擇12.96828列B中的值並且列C和D中的值不匹配,因此我期望它在列E中返回12.96828。接下來,它將查看69.64817列A中的第二個值並比較它到列B中的值,它應該選擇64.78909,然後評估它是否被選中。然後評估與列B中的值相對應的列C(2)中的值並評估與列C中的選定值相對應的列D(2)中的值。雖然這是第一次選擇64.78909,但列C和D是相同的,因此我需要從94.7089的B列中選擇下一個最接近的值,然後評估它是否被選中;它沒有。然後評估列C中對應於列A(C = 2中的值)的值,並評估列D中對應於94.7089(D中34的值)的值並對它們進行比較。由於沒有選擇94.7089的值,並且列C(C = 2的值)和D(D = 34的值)中的值不相同,因此返回94.7089到列E.

推進,希望我描述了我的問題充分行A.

98線

 a   b c d 
1 12.97221 297.91173 1 1 
2 69.64817 298.19087 2 2 
3 318.87946 169.03864 3 3 
4 326.17622 169.32014 4 4 
5 137.54006 336.65953 5 5 
6 258.06002 94.70890 6 6 
7 258.92824 94.77531 7 7 
8 98.57514 290.19832 8 8 
9 98.46303 290.40790 9 9 
10 317.22764 154.38380 10 10 
11 316.64421 148.78655 11 11 
12 310.73702 153.32877 12 12 
13 237.32708 107.83971 13 13 
14 250.65386 108.05706 14 14 
15 337.09543 180.63118 15 15 
16 337.03365 181.02949 16 16 
17 301.22772 185.20628 17 17 
18 332.93530 185.97922 18 18 
19 340.84127 220.40438 19 19 
20 357.42706 220.83922 20 20 
21 244.89806 83.18630 21 21 
22 244.84391 83.28693 22 22 
23 97.16921 338.39649 23 23 
24 114.62798 338.43398 24 24 
25 178.90640 53.22144 25 25 
26 175.59257 57.77149 26 26 
27 173.32116 60.62938 27 27 
28 172.20906 61.93639 28 28 
29 246.51226 150.04782 29 29 
30 258.00836 150.65750 30 30 
31 259.85790 156.03397 31 31 
32 326.10208 230.30117 32 32 
33 324.96532 230.59314 33 33 
34 319.40851 233.05470 34 34 
35 146.11989 10.86714 35 35 
36 144.63489 12.96828 36 36 
37 139.89335 18.90677 37 37 
38 119.96566 18.75278 38 38 
39 109.18017 28.03931 39 39 
40 108.24683 28.87934 40 40 
41 302.29211 230.30386 41 41 
42 297.28305 233.96142 42 42 
43 244.72843 77.53609 43 43 
44 244.55468 77.62372 44 44 
45 243.47944 78.07812 45 45 
46 181.89548 55.90604 46 46 
47 180.80139 55.99444 47 47 
48 150.37128 59.83512 48 48 
49 51.28074 279.08373 49 49 
50 50.95031 279.21971 50 50 
51 50.57658 279.37713 51 51 
52 48.12937 281.07891 52 52 
53 154.16485 22.38683 53 53 
54 153.48482 22.52214 54 54 
55 145.03992 27.13075 55 55 
56 108.21414 31.28673 56 56 
57 270.96258 182.05611 57 57 
58 269.78887 149.38115 58 58 
59 256.37371 154.75579 59 59 
60 153.74159 25.74645 60 60 
61 151.10381 21.27617 61 61 
62 97.67447 25.97402 62 62 
63 60.73636 259.29946 63 63 
64 11.86492 261.25563 64 64 
65 287.19987 262.01448 65 65 
66 312.08016 234.55050 66 66 
67 315.96324 234.79214 67 67 
68 323.03643 235.31352 68 68 
69 32.71810 333.35849 69 69 
70 59.63687 337.21593 70 70 
71 276.34373 115.55930 71 71 
72 276.31857 115.67837 72 72 
73 275.19374 119.76535 73 73 
74 97.94697 288.88226 74 74 
75 97.60657 289.19108 75 75 
76 97.53337 289.26658 76 76 
77 173.02153 84.88042 77 77 
78 171.27572 86.35787 78 78 
79 169.44530 87.38803 79 79 
80 87.67228 297.48545 80 80 
81 87.54748 297.88451 81 81 
82 86.59445 301.10765 82 82 
83 332.49688 185.82157 83 83 
84 331.19924 186.74459 84 84 
85 222.30368 63.98160 85 85 
86 221.44599 64.24739 86 86 
87 219.66419 64.78909 87 87 
88 229.48482 139.27264 88 88 
89 228.76817 109.94767 89 89 
90 214.77135 105.61337 90 90 
91 208.44254 107.75702 91 91 
92 224.10799 84.52048 92 92 
93 222.94849 87.27893 93 93 
94 222.54903 88.00606 94 94 
95 222.13538 88.80756 95 95 
96 110.52286 321.90037 96 96 
97 109.56354 322.20164 97 97 
98 75.80737 325.34917 98 98 
+0

基於你的描述,幾乎不可能遵循你想要的。如果你用一個例子來引導我們通過你想要的條件,你將更有可能得到答案。對於第一行,我們發現列「a」的值爲「12.97221」。與「b」列中最接近的值在第八行是「12.96828」。由於這是第一行,因此我們知道第八行從未被選擇過。等等。 – Barker

+0

您好,感謝您的快速回復。 – jaz1240

+0

我已更新我的帖子,希望能讓事情更清楚。謝謝! – jaz1240

回答

1

因此,這裏是你的答案; 解釋嵌入答案本身 (我從數據集中刪除逗號)

setwd("~/Desktop/") 
df <- read.table("trial.txt",header=T,sep="\t") 
names(df) <- c("a","B","C","D") 
df_backup <- df 
df$newcol <- NA 

used <- c() 
for (i in seq(1,length(df$a),1)){ 
    print("######## Separator ########") 
    print(paste("searching right match that fits criteria for ",df$a[i],"in column 'a'",sep="")) 
    valueA <- df[i,1] 
    orderx <- order(abs(df$B-valueA)) 

    index=1 
    while (is.na(df$newcol[i])) { 
    j=orderx[index] 
    if (df$B[j] %in% used){ 
     print(paste("passing ",df$B[j], "as it has already been used",sep="")) 
     index=index+1 
     next 
    } else { 
     indexB <- j 
     valueB <- df$B[indexB] 
     print(paste("trying ",valueB,sep="")) 

     if (df$C[i] != df$D[indexB]) { 
     df$newcol[i] <- df$B[indexB] 
     print(paste("using ",valueB,sep="")) 
     used <- c(used,df$B[indexB]) 
     } else { 
     df$newcol[i] <- NA 
     print(paste("cant use ",valueB,"as the column C (related to index in A) and D (related to index in B) values are matching",sep="")) 
     } 

     index=index+1 
    } 
    } 
} 

輸出看起來像這樣

[1] "######## Separator ########" 
[1] "searching right match that fits criteria for 12.97221in column 'a'" 
[1] "trying 12.96828" 
[1] "using 12.96828" 
[1] "######## Separator ########" 
[1] "searching right match that fits criteria for 69.64817in column 'a'" 
[1] "trying 64.78909" 
[1] "cant use 64.78909as the column C (related to index in A) and D (related to index in B) values are matching" 
[1] "trying 94.7089" 
[1] "using 94.7089" 
[1] "######## Separator ########" 
[1] "searching right match that fits criteria for 318.87946in column 'a'" 
[1] "trying 321.90037" 
[1] "using 321.90037" 
[1] "######## Separator ########" 
[1] "searching right match that fits criteria for 326.17622in column 'a'" 
[1] "trying 325.34917" 
[1] "using 325.34917" 
[1] "######## Separator ########" 
[1] "searching right match that fits criteria for 137.54006in column 'a'" 
[1] "trying 139.27264" 
[1] "using 139.27264" 
[1] "######## Separator ########" 
[1] "searching right match that fits criteria for 258.06002in column 'a'" 
[1] "trying 259.29946" 
[1] "using 259.29946" 
[1] "######## Separator ########" 
[1] "searching right match that fits criteria for 258.92824in column 'a'" 
[1] "passing 259.29946as it has already been used" 
[1] "trying 261.25563" 
[1] "using 261.25563" 
[1] "######## Separator ########" 
[1] "searching right match that fits criteria for 98.57514in column 'a'" 
[1] "trying 94.77531" 
[1] "using 94.77531" 
[1] "######## Separator ########" 
[1] "searching right match that fits criteria for 98.46303in column 'a'" 
[1] "passing 94.77531as it has already been used" 
[1] "passing 94.7089as it has already been used" 
[1] "trying 64.78909" 
[1] "using 64.78909" 
[1] "######## Separator ########" 
[1] "searching right match that fits criteria for 317.22764in column 'a'" 
[1] "passing 321.90037as it has already been used" 
[1] "trying 322.20164" 
[1] "using 322.20164" 

決賽桌看起來是這樣的:

1 2.97221 64.78909 1 2 12.96828 
2 69.64817 321.90037 2 28 94.7089 
3 318.87946 259.29946 3 5 321.90037 
4 326.17622 94.7089 9 8 325.34917 
5 137.54006 325.34917 5 88 139.27264 
6 258.06002 94.77531 6 63 259.29946 
7 258.92824 322.20164 7 64 261.25563 
8 98.57514 12.96828 8 34 94.77531 
9 98.46303 139.27264 9 21 64.78909 
10 317.22764 261.25563 10 97 322.20164 
+0

非常感謝Mandar。當我運行數據我得到這個錯誤「seq中的錯誤。默認(1,長度(df $ a),1):錯誤登錄'by'參數「我試圖排除功能故障,但我沒有運氣,任何進一步的援助將不勝感激。 – jaz1240

+0

檢查您的輸出爲 使用此文件來啓動您的數據 [link](http://www.filedropper.com/trial_2) 1)df < - read.table(「yourdata_trial.txt」,header = T) 2)暗淡(DF) 3)DF $一個 4)長度(DF $一個) – Mandar

+0

喜,再次感謝Mandar。結果>暗淡(DF) [1] 98 4 > DF $一個 NULL >長度(df $ a) [1] 0 – jaz1240