如何計算兩個句子之間的相似度（句法和語義）

我應該每次取兩個句子並計算它們是否相似。我的意思是，在語法和語義上都是這樣。如何計算兩個句子之間的相似度（句法和語義）

INPUT1：奧巴馬簽署法律。奧巴馬簽署了一項新法律。

INPUT2：總線停在這裏。車輛停在這裏。

INPUT3：紐約的火災。紐約被燒燬。

INPUT4：在紐約的火災。在紐約大火中死亡50人。

我不想用本體樹作爲靈魂。我寫了一個代碼來計算句子之間Levenshtein distance（LD），然後決定是否第二個句子：

可以忽略不計（INPUT1和2），
應更換的第一句話（INPUT 3），或
與第一句（INPUT4）一起存儲。

我對代碼不滿意，因爲LD只計算語法級別（還有其他什麼方法？）。語義如何融入（比如公交車就像是一輛車？）。

的代碼放在這裏：

%# As the difference is computed, a decision is made on the new event 
%# (string 2) to be ignored, to replace existing event (string 1) or to be 
%# stored separately. The higher the LD metric, the higher the difference 
%# between two strings. Of course, lower difference indices either identical 
%# or similar events. However, the higher difference indicates the new event 
%# as a fresh event. 

%#......................................................................... 
%# Calculating the LD between two strings of events. 
%#......................................................................... 
L1=length(str1)+1; 
L2=length(str2)+1; 
L=zeros(L1,L2); %# Initializing the new length. 

g=+1;    %# just constant 
m=+0;    %# match is cheaper, we seek to minimize 
d=+1;    %# not-a-match is more costly. 

% do BC's 
L(:,1)=([0:L1-1]*g)'; 
L(1,:)=[0:L2-1]*g; 

m4=0;    %# loop invariant 
%# Calculating required edits. 
for idx=2:L1; 
    for idy=2:L2 
     if(str1(idx-1)==str2(idy-1)) 
      score=m; 
     else 
      score=d; 
     end 
     m1=L(idx-1,idy-1) + score; 
     m2=L(idx-1,idy) + g; 
     m3=L(idx,idy-1) + g; 
     L(idx,idy)=min(m1,min(m2,m3)); % only minimum edits allowed. 
    end 
end 
%# The LD between two strings. 
D=L(L1,L2); 

%#.................................................................... 
%# Making decision on what to do with the new event (string 2). 
%#................................................................... 
if (D<=4)  %# Distance is so less that string 2 seems identical to string 1. 
    store=str1;  %# Hence string 2 is ignored. String 1 remains stored. 
elseif (D>=5 && D<=15) %# Distance is larger to be identical but not enough to 
    %# make string 2 an individual event. 
    store= str2;  %# String 2 is somewhat similar to string 1. 
         %# So, string 1 is replaced with string 2 and stored. 
else 
    %# For all other distances, string 2 is stored along with string 1. 
    store={str1; str2}; 
end

任何幫助表示讚賞。

來源

2010-09-07 Tinglin

「語義上」。沒有簡單的文本書算法。自然語言（特別是英語）是一個非常複雜而反覆無常的野獸。 – 2010-09-07 22:16:49

@Amro：「'＃'」使它們變灰，因爲這裏的註釋是SO？ – Lazer 2010-09-14 08:41:33

@Lazer：是的，它的眼睛更容易..我希望StackOverflow引入了包含代碼塊的功能，如：'...'，以便爲該特定語言正確突出顯示 – Amro 2010-09-14 15:54:46

「語義上」。 沒有簡單的文本書算法。自然語言（特別是英語）是一個非常複雜而反覆無常的野獸。讓我們看看（只是一小部分）所提供的情況：

INPUT1: Obama signs the law. A new law is signed by Obama.

簽署一項法律，使其成爲一個「新」的法律。

INPUT2: A Bus is stopped here. A vehicle stops here.

需要知道總線是一種類型，如果車輛以及某種時間關係。另外，如果公交車做了停車，但通常不停車或不再停車？它可以採取幾種方式。

INPUT3: Fire in NY. NY is burnt down.

需要知道火災會燒燬東西。

INPUT4: Fire in NY. 50 died in NY fire.

需要知道的火災可以殺死的東西（見下）。需要將「新聞標題」（50條）與人們聯繫起來。大腦可以做到這一點微不足道。計算機程序不是大腦。

而我不是英語專業:-)

來源

2010-09-07 22:21:26

非常。我發現一些聯合WORD網絡的作品有時試圖將一個詞重新排版到其他幾個詞（如鎖是門的一部分，公交車是車輛，火可以燃燒等）。但我不確定如何在我的代碼中實現它。 – Tinglin 2010-09-08 03:46:28

如何計算兩個句子之間的相似度（句法和語義）

回答

相關問題