如何讀取for-loop中的兩個文件並根據另一個文件中的匹配值更新一個文件中的值？

我想通過同時讀取兩個文件來更新列中的值。如何讀取for-loop中的兩個文件並根據另一個文件中的匹配值更新一個文件中的值？

main_file有以下數據：

contig pos GT PGT_phase PID PG_phase PI 
2 1657 ./. . . ./. . 
2 1738 0/1 . . 0|1 935 
2 1764 0/1 . . 1|0 935 
2 1782 0/1 . . 0|1 935 
2 1850 0/0 . . 0/0 . 
2 1860 0/1 . . 1|0 935 
2 1863 0/1 . . 0|1 935 
2 2969 0/1 . . 1|0 3352 
2 2971 0/0 . . 0/0 . 
2 5207 0/1 0|1 5185 1|0 1311 
2 5238 0/1 . . 0|1 1311 
2 5241 0/0 . . 0/0 . 
2 5258 0/1 . . 1|0 1311 
2 5260 0/0 . . 0/0 . 
2 5319 0/0 . . 0/0 . 
2 5398 0/1 0|1 5398 1|0 1311 
2 5403 0/1 0|1 5398 1|0 1311 
2 5426 0/1 0|1 5398 1|0 1311 
2 5427 0/1 0|1 5398 0/1 . 
2 5434 0/1 0|1 5398 1|0 1311 
2 5454 0/1 0|1 5398 0/1 . 
2 5457 0/0 . . 0/0 . 
2 5467 0/1 0|1 5467 0|1 1311 
2 5480 0/1 0|1 5467 0|1 1311 
2 5483 0/0 0|1 5482 0/0 . 
2 6414 0/1 . . 0|1 1667 
2 6446 0/1 0|1 6446 0|1 1667 
2 6448 0/1 0|1 6446 0|1 1667 
2 6465 0/1 0|1 6446 0|1 1667 
2 6636 0/1 . . 1|0 1667 
2 6740 0/1 . 6740 0|1 1667 
2 6748 0/1 . 6740 0|1 .

的另一match_file有以下類型信息的：

**PI  PID** 
1309 3617741,3617753,3617788,3618156,3618187,3618289 
131  11793586 
1310  
1311 5185,5398,5467,5576 
1312 340692,340728 
1313 18503498 
1667 6740,12237,12298

我所試圖做的事：

我想創建一個新列（new_PI）與已更新PI值。

更新工作原理：

所以，如果有在main_file的線PI值，其簡單：new_PI value = main_PI然後continue
如果main_file兩main_PI和main_PID是.,new_PI = .和continue
但是，如果PI值是'。'但是PID值是一些整數，現在我們查看match_file中的PI值，該值包含PID的列表中的該值。如果匹配的PID被發現new_PI = PI_match_file然後continue

我已經寫了下面的代碼：

main_file = open("2ms01e_chr2_table.txt", 'r+') match_file = open('updated_df_table.txt', 'r+') main_header = main_file.readline() match_header = match_file.readline() main_data = main_file.read().rstrip('\n').split('\n') match_data = match_file.read().rstrip('\n').split('\n') file_update = open('PI_updates.txt', 'w') file_update.write('contig pos GT PGT_phase PID PG_phase PI new_PI\n') file_update.close() for line in main_data: main_column = line.split('\t') PID_main = main_column[4] PI_main = main_column[6] if PID_main == '.' and PI_main == '.': new_PI = '.' continue if PI_main != '.': new_PI = PI_main continue if PI_main == '.' and PID_main != '.': for line in match_data: match_column = line.split('\t') PI_match = match_column[0] PID_match = match_column[1].split(',') if PID_main in PID_match: new_PI = PI_match continue file_update = open('PI_updates.txt', 'a') file_update.write(line + '\t' + str(new_PI)+ '\n') file_update.close()

我沒有得到任何錯誤，但貌似我沒有寫相應的代碼來讀取兩個文件。

我的輸出應該是這樣的：

contig pos GT PGT PID PG PI new_PI 2 5426 0/1 0|1 5398 1|0 1311 1311 2 5427 0/1 0|1 5398 0/1 . 1311 2 5434 0/1 0|1 5398 1|0 1311 1311 2 5454 0/1 0|1 5398 0/1 . 1311 2 5457 0/0 . . 0/0 . . 2 5467 0/1 0|1 5467 0|1 1311 1311 2 5480 0/1 0|1 5467 0|1 1311 1311 2 5483 0/0 0|1 5482 0/0 1667 1667 2 5518 1/1 1|1 5467 1/1 . 1311 2 5519 0/0 . . 0/0 . . 2 5547 1/1 1|1 5467 1/1 . 1311 2 5550 ./. . . ./. . . 2 5559 1/1 1|1 5467 1/1 . 1311 2 5561 0/0 . . 0/0 . . 2 5576 0/1 0|1 5576 1|0 1311 1311 2 5599 0/1 0|1 5576 1|0 1311 1311 2 5602 0/0 . . 0/0 . . 2 5657 0/1 . . 1|0 1311 1311 2 5723 0/1 . . 1|0 1311 1311 2 6414 0/1 . . 0|1 1667 1667 2 6446 0/1 0|1 6446 0|1 1667 1667 2 6448 0/1 0|1 6446 0|1 1667 1667 2 6465 0/1 0|1 6446 0|1 1667 1667 2 6636 0/1 . . 1|0 1667 1667 2 6740 0/1 . 6740 0|1 1667 1667 2 6748 0/1 . 6740 0|1 . 1667

提前感謝！

來源

2016-12-27 everestial007

您的代碼看起來很好，除非您的代碼通常不會附加PI_update文件的行。 continue語句終止循環迭代移動到下一次迭代，從而跳過文件寫入行。如果輸入了第三個if語句，則情況並非如此，因爲continue語句只會終止內部循環。

有點相關，我有一個快速的勝利給你：你有兩個for循環堆疊。相反，您可以用字典中的查找來代替match_data的迭代。這可以提供更大的文件加速。此外，您可能希望將new_PI值存儲在列表中，並在代碼末尾執行一次寫入操作。文件I/O的性能通常很高，應該儘可能少。

編輯：

main_data = main_file.read().rstrip('\n').split('\n') 
match_data = match_file.read().rstrip('\n').split('\n') 
match_map = {} # instantiate empty dict 
for line in match_data: 
    PI, PIDs = line.split('\t') 
    # update the dict with all the PIDs from this line 
    match_map.update({PID:PI for PID in PIDs}) 

PI_updates = 'contig\tpos\tGT\tPGT_phase\tPID\tPG_phase\tPI\tnew_PI\n' 

for line in main_data: 
    _, _, _, PID, _, PI = line.split('\t') 
    if PID_main == '.' and PI_main == '.': 
     new_PI = '.' 
    elif PI_main != '.': 
     new_PI = PI_main 
    else: 
     # dict.get(key, default) returns default if key doesn't return a value 
     new_PI = match_map.get(PID, 'no match found') 
    # append the result to the PI_updates string 
    PI_updates += line + '\t' + str(new_PI)+ '\n' 

# let with statement take care of closing the file 
with open('PI_updates.txt', 'w') as file_update: 
    file_update.write(PI_updates)

來源

2016-12-27 00:54:10 Sebastiaan

嗨@Sebastiaan，我實際上試過使用字典，但沒有成功。讓我知道你是否可以幫忙。謝謝！ – everestial007

我添加了一個關於如何使用字典的例子，我希望這會有所幫助。如果我的回答對你有幫助，你會介意將其標記爲答案嗎？ – Sebastiaan

我應該用break而非continue（例子）。另外，在其他地方繼續是沒有幫助的。

main_file = open("2ms01e_chr2_table.txt", 'r+') 
match_file = open('updated_df_table.txt', 'r+') 


main_header = main_file.readline() 
match_header = match_file.readline() 
print(match_header, "\n**") 

main_data = main_file.read().rstrip('\n').split('\n') 
match_data = match_file.read().rstrip('\n').replace('[', '')\ 
    .replace("'", "").replace(']', '').replace(" ", '') 
match_data = match_data.split('\n') 

file_update = open('PI_updates.txt', 'w') 
file_update.write('contig pos GT PGT_phase PID PG_phase PI new_PI\n') 
file_update.close() 

for line in main_data: 
    main_column = line.split('\t') 
    PID_main = main_column[4] 
    PI_main = main_column[6] 
    chrom = main_column[0] 
    pos = main_column[1] 
    if PID_main == '.' and PI_main == '.': 
     new_PI = '.' 

    if PI_main != '.': 
     new_PI = PI_main 

    elif PI_main == '.' and PID_main != '.': 
     for line1 in match_data: 
      match_column = line1.split('\t') 
      PI_match = match_column[0] 
      PID_match = match_column[1].split(',') 
      if PID_main in PID_match: 
       new_PI = PI_match 
       break 
      elif PID_main not in PID_match: 
       new_PI = str(chrom) + '_' + str(PID_main) 

    file_update = open('PI_updates.txt', 'a') 
    file_update.write(line + '\t' + str(new_PI)+ '\n') 
    file_update.close()

來源

2016-12-27 01:58:24 everestial007

如何讀取for-loop中的兩個文件並根據另一個文件中的匹配值更新一個文件中的值？

回答

相關問題