2017-06-05 61 views
0

我試圖實現遺傳算法將DNA片段組裝成一個只給出光譜序列。我唯一的運營商是Edge Recombination,我確信它足以獲得相當不錯的結果。正確的邊緣重組交叉DNA裝配

但是......我無法擊敗80%(最佳分數的百分比),並且有500個碎片的實例可能需要2小時(如果在100次迭代中沒有改善,算法會停止)。我甚至執行它的權利?我沒有發現任何交叉運營商應該選擇匹配更好的元素(片段重疊 - 在大多數論文中,它只是我們挑選的),但是我實現了它,它選擇了最匹配的一個,如果有很多 - 隨機一個。沒有選擇最好的一個,它甚至不會達到40%。

我應該實施更多的交叉?或者它不是如何工作......或者我錯過了什麼?我在這一點上很絕望,等待24小時爲40個實例簡單的啓發式(山登山者)可以在5秒做...

有代碼(排列是一個洗牌的頻譜 - 字符串列表)

def crossover(g1, g2, arr, ind): # edge recombination operator 
    neigh_list = {} # adjacency list 
    length = len(g1.permutation) # expected length of a child 
    for i, base in enumerate(g1.permutation): # create nodes 
     neigh_list[base] = {g1.permutation[i - 1], g1.permutation[(i + 1) % length]} 
    for i, base in enumerate(g2.permutation): # add neighbours to each node 
     neigh_list[base].add(g2.permutation[i - 1]) 
     neigh_list[base].add(g2.permutation[(i + 1) % length]) 

    # a starting point of a child is a starting point of one of the parents 
    neigh_chosen = [g1.permutation[0], g2.permutation[0]][random.randint(0, 1)] 
    child = [neigh_chosen] 

    while len(child) < length: # run until child has desired length 
     min_length = 5 # each list has lower length than 5 
     min_neigh_list = [] 
     for k in neigh_list: # for every node 
      if neigh_chosen in neigh_list[k]: # remove a chosen fragment from the node 
       neigh_list[k].remove(neigh_chosen) 
     for k in neigh_list[neigh_chosen]: # if a node is a neighbour of previously chosen 
      if len(neigh_list[k]) < min_length: # remember nodes with the fewest neighbours 
       min_length = len(neigh_list[k]) 
       min_neigh_list = [k] 
      elif len(neigh_list[k]) == min_length: 
       min_neigh_list.append(k) 
     del neigh_list[neigh_chosen] # delete list of the chosen node 
     if len(min_neigh_list) > 0: # if the chosen node has any neighbours 
      # get the best match out of neighbours as next 
      max_overlap = overlap(neigh_chosen, max(min_neigh_list, key=lambda x: overlap(neigh_chosen, x))) 
      possibilities = list(filter(lambda x: overlap(neigh_chosen, x) == max_overlap, min_neigh_list)) 
      neigh_chosen = possibilities[random.randint(0, len(possibilities) - 1)] 
     else: 
      # get the best match out of every node as next 
      max_overlap = overlap(neigh_chosen, max(neigh_list, key=lambda x: overlap(neigh_chosen, x))) 
      possibilities = list(filter(lambda x: overlap(neigh_chosen, x) == max_overlap, neigh_list)) 
      neigh_chosen = possibilities[random.randint(0, len(possibilities) - 1)] 
     child.append(neigh_chosen) # add the node to the solution 
    arr[ind] = Gene(child) 
+1

你將很難得到這裏的答案,我認爲(或任何),因爲這是一個相當專業了。我對你正在做的事情很感興趣(我沒有說自己能夠回答),但有一件事可以幫助我,如果你堅持[PEP8](https://www.python.org/dev/peps/pep-0008 /)來打破這個巨大的代碼塊。儘管如此,我認爲通過啓發式方法需要一些實實在在的努力。 – roganjosh

+1

您可能會在新的生物信息學堆棧交換站點上獲得很好的答案:https://area51.stackexchange.com/proposals/109245/bioinformatics – nbryans

回答

0

好吧,我設法解決這個做一個很簡單的伎倆。問題在於理解這個操作員實際做了什麼以及在哪裏使用它。爲了讓幫助做DNA裝配問題只是

min_neigh_list = neigh_list[neight_chosen] 
    del neigh_list[neigh_chosen] 

它的工作原理的原因是更換

for k in neigh_list[neigh_chosen]: # if a node is a neighbour of previously chosen 
     if len(neigh_list[k]) < min_length: # remember nodes with the fewest neighbours 
      min_length = len(neigh_list[k]) 
      min_neigh_list = [k] 
     elif len(neigh_list[k]) == min_length: 
      min_neigh_list.append(k) 
    del neigh_list[neigh_chosen] 

,因爲我們並不尋求與鄰國的最小數量的片段(這可能是重要的其他類型的圖形相關的問題!),但最好的匹配鄰居!

它的工作原理就像一個魅力這個簡單的變化:)

+0

只是好奇,是否還需要2個小時? – mitoRibo

+0

現在它給出95 +%的分數,每個問題實例需要5秒到240秒(300個片段到700個片段) –