2014-12-04 87 views
3

我正在嘗試從漢明距離矩陣創建一個字符串列表。每個字符串必須是20個字符長,並帶有4個字母的字母表(A,B,C,D)。例如,假設我有以下的漢明距離矩陣:確定滿足海明距離矩陣的字符串

S1 S2 S3 
S1 0 5 12 
S2 5 0 14 
S3 12 14 0 

從這個矩陣,我需要創建3個字符串,例如:

S1 = "ABBBBAAAAAAAAAABBBBB" 
S2 = "BAAAAAAAAAAAAAABBBBB" 
S3 = "CBBBABBBBBBBBBBBBBBB" 

我手工創建這些字符串,但我需要做的這對於代表100個絃線的漢明距離矩陣來說是不切實際的,以手動進行。任何人都可以提出一個算法可以做到這一點?

謝謝,克里斯

回答

1

這是一個有趣的練習。 :-)

以下octave腳本隨機產生n字符串的長度爲len。隨後它計算所有這些字符串之間的海明距離。

接下來要做的是字符串是兩兩比較的。例如,如果您搜索[5 12 14],則會發現表N包含512之間的字符串,以及1214分開的字符串。接下來的挑戰當然是找到其中相隔512的電路可以與1214分開的電路放在一起,以使電路「閉合」。

 
% We generate n strings of length len 
n=50; 
len=20; 

% We have a categorical variable of size 4 (ABCD) 
cat=4; 

% We want to generate strings that correspond with the following hamming distance matrix 
search=[5 12 14]; 
%search=[10 12 14 14 14 16]; 
S=squareform(search); 

% Note that we generate each string totally random. If you need small distances it makes sense to introduce 
% correlations across the strings 
X=randi(cat-1,n,len); 

% Calculate the hamming distances 
t=pdist(X,'hamming')*len; 

% The big matrix we have to find our little matrix S within 
Y=squareform(t); 

% All the following might be replaced by something like submatrix(Y,S) if that would exist 
R=zeros(size(S),size(Y)); 
for j = 1:size(S) 
    M=zeros(size(Y),size(S)); 
    for i = 1:size(Y) 
     M(i,:)=ismember(S(j,:),Y(i,:)); 
    endfor 
    R(j,:)=all(M'); 
endfor 

[x,y]=find(R); 

% A will be a set of cells that contains the indices of the columns/rows that will make up our submatrices 
A = accumarray(x,y,[], @(v) {sort(v).'}); 

% If for example the distance 5 doesn't occur at all, we can already drop out 
if (sum(cellfun(@isempty,A)) > 0) 
    printf("There are no matches\n"); 
    return 
endif 

% We are now gonna get all possible submatrices with the values in "search" 
C = cell(1, numel(A)); 
[C{:}] = ndgrid(A{:}); 

N = cell2mat(cellfun(@(v)v(:), C, 'UniformOutput',false)); 
N = unique(sort(N,2), 'rows'); 

printf("Found %i potential matches (but contains duplicates)\n", size(N,1)); 

% We are now further filtering (remove duplicates) 
[f,g]=mode(N,2); 
h=g==1; 
N=N(h,:); 

printf("Found %i potential matches\n", size(N,1)); 

M=zeros(size(N),size(search,2)); 
for i = 1:size(N) 
    f=N(i,:); 
    M(i,:)=squareform(Y(f,f))'; 
endfor 

F=squareform(S)'; 

% For now we forget about wrong permutations, so for search > 3 you need to filter these out! 
M = sort(M,2); 
F = sort(F,2); 

% Get the sorted search string out of the (large) table M 
% We search for the ones that "close" the circuit 
D=ismember(M,F,'rows'); 
mf=find(D); 

if (mf) 
    matches=size(mf,1); 
    printf("Found %i matches\n", matches); 
    for i = 1:matches 
     r=mf(i); 
     printf("We return match %i (only check permutations now)\n", r); 
     t=N(r,:)'; 
     str=X(t,:); 
     check=squareform(pdist(str,'hamming')*len); 
     strings=char(str+64) 
     check 
    endfor 
else 
    printf("There are no matches\n"); 
endif 

它會生成字符串如:

 
ABAACCBCACABBABBAABA 
ABACCCBCACBABAABACBA 
CABBCBBBABCBBACAAACC 
+0

感謝安妮!此方法可行,但Octave在嘗試執行更多字符串(11個字符串)時會拋出「內存不足」錯誤。第一次使用Octave可能是我的無知,但具體使用:search = [0 0 3 3 5 5 5 3 8 8 3 3 5 5 5 3 8 8 3 2 2 2 3 8 8 5 5 5 6 8 11 0 0 3 9 8 0 3 9 8 3 9 8 8 5 8]與n = 10000(我已經使用較低的n,但返回「沒有匹配」)。 – ChrisUofR 2014-12-15 18:59:57

+0

@ChrisUofR。如果您希望使用您指定的漢明距離矩陣的100個字符串(或僅略大一些),則此算法很有用。你必須爲此運行該算法> 100次。它每次都不會找到一組字符串;你必須運行它幾次,直到它。利用11x11的矩陣,它將創建一個巨大的搜索空間。你將需要找到一個優化的算法。而且,當前隨機生成器很少遇到諸如0,2和3的距離。所以,你還需要提供一個不同於「randi」的產品。 – 2014-12-16 11:23:46

+0

此外,你必須小心從哪裏得到大矩陣。像'search = [10 0 0]'一樣小的矩陣是不可解的:'S_1'與'S_2'不同,與'S_3'相同。然而,'S_2'與'S_3'相同,這就導致矛盾。 – 2014-12-16 11:34:22