2017-04-07 52 views
2

我有一個如下所示的數據庫。比較並獲得行之間的間隔交點

pos1<-c(5,15,25,40,80,5,18,22,38,84,5,16,50,92,31,50,20,30,50,70,27,50,60,50,90,20,40) 
pos2<-c(10,17,30,42,90,10,20,24,42,87,10,19,52,100,40,70,25,32,60,90,30,60,71,60,100,25,50) 
chr<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2) 
n<-c(25,65,78,56,35,78,58,98,14,25,65,85,98,74,20,36,48,98,52,69,21,47,53,10,12,37,82) 
pop<-c("A","A","A","A","A","B","B","B","B","C","C","C","C","C","D","D","A","A","A","A","B","B","B","C","C","D","D") 
data<-data.frame(pos1,pos2,chr,pop,n) 

位置1和位置2設計用於每個字符和人口的間隔的開始和結束點。我的目的是要獲得流行音樂A,B和C(不是D)之間的哪個區間相交,哪些區間對於每個人羣是唯一的。

所以,獨特的間隔我將有一個結果data.frame類似如下:

pos1.u<-c(25,50,92,20,30,27,90) 
pos2.u<-c(30,52,100,25,32,30,100) 
chr.u<-c(1,1,1,2,2,2,2) 
pop.u<-c("A","B","C","A","A","B","C") 
n.u<-c(78,98,74,48,98,21,12) 
data.u<-data.frame(pos1.u,pos2.u,chr.u,pop.u,n.u) 

而對於那些3個種羣之間相交下面這樣的data.frame間隔:

pos1.c<-c(5,15,40,80,5,38,85,5,16,50,70,50,60,50) 
pos2.c<-c(10,17,42,90,10,42,87,10,19,60,90,60,71,60) 
chr.c<-c(1,1,1,1,1,1,1,1,1,2,2,2,2,2) 
pop.c<-c("A","A","A","A","B","B","B","C","C","A","A","B","B","C") 
n.c<-c(25,65,56,35,78,14,25,65,85,52,69,47,53,10) 
data.c<-data.frame(pos1.c,pos2.c,chr.c,pop.c,n.c) 

我不知道如何寫一個腳本來做到這一點,你能幫助我嗎?

+0

你是什麼意思「這3個人口之間的相交」?最好我可以告訴,在A,B和C中出現的'pos1','pos2'和'chr'只有兩種組合:5,10和1以及50,60和2. – ulfelder

+0

那些是具有完整相交的線段。但我對每個重疊的細分都感興趣。也許我應該使用重疊而不是交叉...抱歉。所以我想找到重疊的每個片段和每個不重疊的片段!感謝您的問題!我希望你可以進一步幫助我... – Cisco

+0

而「重疊」,你的意思是,在'pos1'開始到'pos2'結束的序列的一部分也出現在特定的組合'chr'和'pop'中至少有一個序列具有相同的'chr'值,但'pop'值不同,對吧? – ulfelder

回答

1

我認爲下面的代碼可以滿足您的要求,但它會產生與您不同的結果 - 請仔細檢查!我認爲這種差異在於開放和封閉區間的定義。以下假定既沒有包括端點,也沒有包括端點,但我懷疑這可能不是您的意思(否則(15,18)和(17,19)不會被視爲重疊,因爲沒有整數值落入兩者) 。所以你可能需要調整下面的打開/關閉定義。

pos1<-c(5,15,25,40,80,5,18,22,38,84,5,16,50,92,31,50,20,30,50,70,27,50,60,50,90,20,40) 
pos2<-c(10,17,30,42,90,10,20,24,42,87,10,19,52,100,40,70,25,32,60,90,30,60,71,60,100,25,50) 
chr<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2) 
n<-c(25,65,78,56,35,78,58,98,14,25,65,85,98,74,20,36,48,98,52,69,21,47,53,10,12,37,82) 
pop<-c("A","A","A","A","A","B","B","B","B","C","C","C","C","C","D","D","A","A","A","A","B","B","B","C","C","D","D") 
data<-data.frame(pos1,pos2,chr,pop,n,stringsAsFactors = FALSE) 

library(intervals) 
data<-data[data$pop!="D",] #remove irrelevant D entries 
rownames(data) <- seq_len(nrow(data)) #reset rownames to allow for removed Ds 

#set ints as a list of intervals (as required by intervals package) 
ints <- tapply(1:nrow(data),data$pop,function(v) 
     Intervals(as.matrix(data[v,c("pos1","pos2")]), 
     closed=c(FALSE,FALSE), #this is where you adjust open/closed lower and upper ends of the intervals - TRUE means end value included 
     type="Z")) #Z is integers 
pops <- unique(data$pop) #unique values of pop 
popidx <- lapply(pops,function(x) which(data$pop==x)) #list of indices of these values in data 
names(popidx) <- pops 

#sets is a df of all pairwise combinations to check 
sets <- expand.grid(pops,pops,stringsAsFactors = FALSE) 
sets <- sets[sets$Var1!=sets$Var2,] 

olap <- lapply(1:nrow(sets),function(i) 
     interval_overlap(ints[[sets$Var1[i]]],ints[[sets$Var2[i]]])) #list of overlaps 
olap <- lapply(1:nrow(sets),function(i) { 
    df<-as.data.frame(olap[[i]],stringsAsFactors=FALSE) 
    df$pos1 <- as.numeric(rownames(df)) 
    df$pos2 <- sapply(1:nrow(df),function(j) popidx[[sets$Var2[i]]][df[j,1][[1]][1]]) 
    return(df)}) #tidy up as dfs, with correct indices in data (rather than in ints) 
olap <- do.call(rbind,olap)[,-1] #join dataframes 
olap$olaps <- !is.na(olap$pos2) #identify those with overlaps 

#group by unique pos1 and identify max and min no of overlaps with other groups 
olap <- data.frame(minoverlap=tapply(olap$olaps,olap$pos1,min),maxoverlap=tapply(olap$olaps,olap$pos1,max)) 
olap$rowno <- as.numeric(rownames(olap)) 

uniques <- data[olap$rowno[olap$maxoverlap==0],] #intervals appearing in just one pop 
commons <- data[olap$rowno[olap$minoverlap>0],] #intervals with an overlap in all other pops 
+0

感謝!它完美的工作! – Cisco

+0

這是一種解脫!不過,我確定必須有更優雅的方式。 –

+0

安德魯。我有個問題。我一直在使用這個腳本,現在已經修改了很久。我只知道可能不考慮變量「Chr」。我希望你能幫助我比較同一個「Chr」中的間隔(在這個例子中是1和2)。我能解釋一下自己嗎? – Cisco