2013-03-06 19 views
-4

我正在研究一個程序,該程序可以估計統計量Tajima在一系列穿過染色體的滑動窗口中的D.染色體本身也被分成許多不同的區域(有望)具有功能意義。滑動窗口分析由我的腳本在每個區域執行。蟒蛇是令人費解的縮短步長,每次迭代的滑動窗口分析

在程序開始時,我定義了滑動窗口的大小以及從一個窗口移動到另一個窗口的步驟大小。我導入一個包含每個不同染色體區域座標的文件,並導入另一個包含我正在處理的所有SNP數據的文件(這是逐行讀取的,因爲它是一個大文件)。程序遍歷染色體位置列表。對於每個位置,它會生成分析的步驟和窗口索引,將SNP數據分成輸出文件(與步驟相對應),計算每個步驟文件的關鍵統計量,並將這些統計量結合起來以估算每個窗口的田島氏D.

該程序適用於SNP數據的小文件。它也適用於第一次染色體斷點的第一次迭代。然而,對於SNP數據的大文件,分析中的步長大大減少,因爲程序在每個染色體區域進行迭代。對於第一個染色體區域,步長爲2500個核苷酸(這是假設的)。然而,對於第二個染色體片段,步長爲1966,第三個爲732.

如果有人對此提出任何建議,請告訴我。我特別難過,因爲這個程序似乎適用於小文件而不適用於大文件。

我的代碼如下:

import sys 
import math 
import fileinput 
import shlex 
import string 
windowSize = int(500) 
stepSize = int(250) 
n = int(50)  #number of individuals in the anaysis 
SNP_file = open("SNPs-1.txt",'r') 
SNP_file.readline() 
breakpoints = open("C:/Users/gwilymh/Desktop/Python/Breakpoint coordinates.txt", 'r') 
breakpoints = list(breakpoints) 
numSegments = len(breakpoints) 
# Open a file to store the Tajima's D results: 
outputFile = open("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/Tajima's D estimates.txt", 'a') 
outputFile.write(str("segmentNumber\tchrSegmentName\tsegmentStart\tsegmentStop\twindowNumber\twindowStart\twindowStop\tWindowSize\tnSNPs\tS\tD\n")) 

#Calculating parameters a1, a2, b1, b2, c1 and c2 
numPairwiseComparisons=n*((n-1)/2) 
b1=(n+1)/(3*(n-1)) 
b2=(2*(n**2+n+3))/(9*n*(n-1)) 
num=list(range(1,n))    # n-1 values as a list 
i=0 
a1=0 
for i in num: 
    a1=a1+(1/i) 
    i=i+1 
j=0 
a2=0 
for j in num: 
    a2=a2+(1/j**2) 
    j=j+1 
c1=(b1/a1)-(1/a1**2) 
c2=(1/(a1**2+a2))*(b2 - ((n+2)/(a1*n))+ (a2/a1**2)) 

counter6=0 
#For each segment, assign a number and identify the start and stop coodrinates and the segment name 
for counter6 in range(counter6,numSegments): 
    segment = shlex.shlex(breakpoints[counter6],posix = True) 
    segment.whitespace += '\t' 
    segment.whitespace_split = True 
    segment = list(segment) 
    segmentName = segment[0] 
    segmentNumber = int(counter6+1) 
    segmentStartPos = int(segment[1]) 
    segmentStopPos = int(segment[2]) 
    outputFile1 = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_Count of SNPs and mismatches per step.txt")%(str(segmentNumber),str(segmentName))), 'a') 

#Make output files to index the lcoations of each window within each segment 
    windowFileIndex = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_windowFileIndex.txt")%(str(segmentNumber),str(segmentName))), 'a') 
    k = segmentStartPos - 1 
    windowNumber = 0 
    while (k+1) <=segmentStopPos: 
     windowStart = k+1 
     windowNumber = windowNumber+1 
     windowStop = k + windowSize 
     if windowStop > segmentStopPos: 
      windowStop = segmentStopPos 
     windowFileIndex.write(("%s\t%s\t%s\n")%(str(windowNumber),str(windowStart),str(windowStop))) 
     k=k+stepSize 
    windowFileIndex.close() 

# Make output files for each step to export the corresponding SNP data into + an index of these output files 
    stepFileIndex = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_stepFileIndex.txt")%(str(segmentNumber),str(segmentName))), 'a') 
    i = segmentStartPos-1 
    stepNumber = 0 
    while (i+1) <= segmentStopPos: 
     stepStart = i+1 
     stepNumber = stepNumber+1 
     stepStop = i+stepSize 
     if stepStop > segmentStopPos: 
      stepStop = segmentStopPos 
     stepFile = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_step_%s.txt")%(str(segmentNumber),str(segmentName),str(stepNumber))), 'a') 
     stepFileIndex.write(("%s\t%s\t%s\n")%(str(stepNumber),str(stepStart),str(stepStop))) 
     i=i+stepSize 
    stepFile.close() 
    stepFileIndex.close() 

# Open the index file for each step in current chromosomal segment 
    stepFileIndex = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_stepFileIndex.txt")%(str(segmentNumber),str(segmentName))), 'r') 
    stepFileIndex = list(stepFileIndex) 
    numSteps = len(stepFileIndex) 

    while 1: 
     currentSNP = SNP_file.readline() 
     if not currentSNP: break 
     currentSNP = shlex.shlex(currentSNP,posix=True) 
     currentSNP.whitespace += '\t' 
     currentSNP.whitespace_split = True 
     currentSNP = list(currentSNP) 
     SNPlocation = int(currentSNP[0]) 
     if SNPlocation > segmentStopPos:break 
     stepIndexBin = int(((SNPlocation-segmentStartPos-1)/stepSize)+1) 
     #print(SNPlocation, stepIndexBin) 
     writeFile = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_step_%s.txt")%(str(segmentNumber),str(segmentName),str(stepIndexBin))), 'a') 
     writeFile.write((("%s\n")%(str(currentSNP[:])))) 
     writeFile.close() 

    counter3=0 
    for counter3 in range(counter3,numSteps): 
# open up each step in the list of steps across the chromosomal segment: 
     L=shlex.shlex(stepFileIndex[counter3],posix=True) 
     L.whitespace += '\t' 
     L.whitespace_split = True 
     L=list(L) 
     #print(L) 
     stepNumber = int(L[0]) 
     stepStart = int(L[1]) 
     stepStop = int(L[2]) 
     stepSize = int(stepStop-(stepStart-1)) 
#Now open the file of SNPs corresponding with the window in question and convert it into a list: 
     currentStepFile = open(("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_step_%s.txt")%(str(segmentNumber),str(segmentName),str(counter3+1)),'r') 
     currentStepFile = list(currentStepFile) 
     nSNPsInCurrentStepFile = len(currentStepFile) 
     print("number of SNPs in this step is:", nSNPsInCurrentStepFile) 
     #print(currentStepFile) 
     if nSNPsInCurrentStepFile == 0: 
      mismatchesPerSiteList = [0] 
     else: 
# For each line of the file, estimate the per site parameters relevent to Tajima's D 
      mismatchesPerSiteList = list() 
      counter4=0 
      for counter4 in range(counter4,nSNPsInCurrentStepFile): 
       CountA=0 
       CountG=0 
       CountC=0 
       CountT=0 
       x = counter4 
       lineOfData = currentStepFile[x] 
       counter5=0 
       for counter5 in range(0,len(lineOfData)): 
        if lineOfData[counter5]==("A" or "a"): CountA=CountA+1 
        elif lineOfData[counter5]==("G" or "g"): CountG=CountG+1 
        elif lineOfData[counter5]==("C" or "c"): CountC=CountC+1 
        elif lineOfData[counter5]==("T" or "t"): CountT=CountT+1 
        else: continue 
       AxG=CountA*CountG 
       AxC=CountA*CountC 
       AxT=CountA*CountT 
       GxC=CountG*CountC 
       GxT=CountG*CountT 
       CxT=CountC*CountT 
       NumberMismatches = AxG+AxC+AxT+GxC+GxT+CxT 
       mismatchesPerSiteList=mismatchesPerSiteList+[NumberMismatches] 
     outputFile1.write(str(("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n")%(segmentNumber, segmentName,stepNumber,stepStart,stepStop,stepSize,nSNPsInCurrentStepFile,sum(mismatchesPerSiteList)))) 
    outputFile1.close() 

    windowFileIndex = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_windowFileIndex.txt")%(str(segmentNumber),str(segmentName))), 'r') 
    windowFileIndex = list(windowFileIndex) 
    numberOfWindows = len(windowFileIndex) 
    stepData = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_Count of SNPs and mismatches per step.txt")%(str(segmentNumber),str(segmentName))), 'r') 
    stepData = list(stepData) 
    numberOfSteps = len(stepData) 

    counter = 0 
    for counter in range(counter, numberOfWindows): 
     window = shlex.shlex(windowFileIndex[counter], posix = True) 
     window.whitespace += "\t" 
     window.whitespace_split = True 
     window = list(window) 
     windowNumber = int(window[0]) 
     firstCoordinateInCurrentWindow = int(window[1]) 
     lastCoordinateInCurrentWindow = int(window[2]) 
     currentWindowSize = lastCoordinateInCurrentWindow - firstCoordinateInCurrentWindow +1 
     nSNPsInThisWindow = 0 
     nMismatchesInThisWindow = 0 

     counter2 = 0 
     for counter2 in range(counter2,numberOfSteps): 
      step = shlex.shlex(stepData[counter2], posix=True) 
      step.whitespace += "\t" 
      step.whitespace_split = True 
      step = list(step) 
      lastCoordinateInCurrentStep = int(step[4]) 
      if lastCoordinateInCurrentStep < firstCoordinateInCurrentWindow: continue 
      elif lastCoordinateInCurrentStep <= lastCoordinateInCurrentWindow: 
       nSNPsInThisStep = int(step[6]) 
       nMismatchesInThisStep = int(step[7]) 
       nSNPsInThisWindow = nSNPsInThisWindow + nSNPsInThisStep 
       nMismatchesInThisWindow = nMismatchesInThisWindow + nMismatchesInThisStep 
      elif lastCoordinateInCurrentStep > lastCoordinateInCurrentWindow: break 
     if nSNPsInThisWindow ==0 : 
      S = 0 
      D = 0 
     else: 
      S = nSNPsInThisWindow/currentWindowSize 
      pi = nMismatchesInThisWindow/(currentWindowSize*numPairwiseComparisons) 
      print(nSNPsInThisWindow,nMismatchesInThisWindow,currentWindowSize,S,pi) 
      D = (pi-(S/a1))/math.sqrt(c1*S + c2*S*(S-1/currentWindowSize)) 
     outputFile.write(str(("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n")%(segmentNumber,segmentName,segmentStartPos,segmentStopPos,windowNumber,firstCoordinateInCurrentWindow,lastCoordinateInCurrentWindow,currentWindowSize,nSNPsInThisWindow,S,D))) 
+8

首先,停下來。將你的代碼分解成單獨的函數,測試每個函數。如果你仍然無法工作,請在這裏發佈代碼的*相關部分*。另外,在StackOverflow中搜索「滑動窗口迭代器」,那裏有很多實現。 – 2013-03-06 22:26:31

+2

此外,你的問題可能與行'stepSize = int(stepStop-(stepStart-1))'有關。 – 2013-03-06 22:28:30

+1

不要'輸入字符串':'模塊'字符串'''警告:'你在這裏看到的大部分代碼現在通常不會被使用。從Python 1.6開始,許多這些函數被實現爲標準字符串對象上的方法。 – askewchan 2013-03-06 22:34:24

回答

5

快速搜索顯示,你就改變你的stepSize上線110:

stepStart = int(L[1]) 
    stepStop = int(L[2]) 
    stepSize = int(stepStop-(stepStart-1)) 

stepStopstepStart似乎取決於你的文件的內容,所以我們無法進一步調試它。