爪哇 - 匹配重讀單詞

我試圖創建一個詞法分析器使用Java德爾福。這裏的示例代碼：爪哇 - 匹配重讀單詞

String[] keywords={"array","as","asm","begin","case","class","const","constructor","destructor","dispinterface","div","do","downto","else","end","except","exports","file","finalization","finally","for","function","goto","if","implementation","inherited","initialization","inline","interface","is","label","library","mod","nil","object","of","out","packed","procedure","program","property","raise","record","repeat","resourcestring","set","shl","shr","string","then","threadvar","to","try","type","unit","until","uses","var","while","with"}; 
String[] relation={"=","<>","<",">","<=",">="}; 
String[] logical={"and","not","or","xor"}; 
Matcher matcher = null; 
for(int i=0;i<keywords.length;i++){ 
    matcher=Pattern.compile(keywords[i]).matcher(line); 
    if(matcher.find()){ 
    System.out.println("Keyword"+"\t\t"+matcher.group()); 
    } 
} 
for(int i1=0;i1<logical.length;i1++){ 
    matcher=Pattern.compile(logical[i1]).matcher(line); 
    if(matcher.find()){ 
    System.out.println("logic_op"+"\t\t"+matcher.group()); 
    } 
}  
for(int i2=0;i2<relation.length;i2++){ 
    matcher=Pattern.compile(relation[i2]).matcher(line); 
    if(matcher.find()){ 
    System.out.println("relational_op"+"\t\t"+matcher.group()); 
    } 
}

所以，當我運行程序，它的工作原理，但它重新閱讀該程序認爲是2令牌例如某些話說：記錄是一個關鍵字，但重新讀取當您令牌邏輯運算符是從REC「或」 d字或。我怎樣才能取消重新閱讀文字？謝謝！

來源

2017-10-10 quSci

正如answer by EvanM所述，您需要在關鍵字前後添加一個\b字邊界匹配器，以防止字符串內的子字符串匹配。

爲了獲得更好的性能，你也應該使用|邏輯正則表達式運算符來匹配多個值之一，而不是創建多個匹配器，所以你只需要掃描一次line，並且只需要一個編譯正則表達式。

您甚至可以將您正在尋找的3種不同類型的標記組合在一個正則表達式中，並使用捕獲組來區分它們，因此您只需要掃描line一次。

像這樣：

String regex = "\\b(array|as|asm|begin|case|class|const|constructor|destructor|dispinterface|div|do|downto|else|end|except|exports|file|finalization|finally|for|function|goto|if|implementation|inherited|initialization|inline|interface|is|label|library|mod|nil|object|of|out|packed|procedure|program|property|raise|record|repeat|resourcestring|set|shl|shr|string|then|threadvar|to|try|type|unit|until|uses|var|while|with)\\b" + 
       "|(=|<[>=]?|>=?)" + 
       "|\\b(and|not|or|xor)\\b"; 
for (Matcher m = Pattern.compile(regex).matcher(line); m.find();) { 
    if (m.start(1) != -1) { 
     System.out.println("Keyword\t\t" + m.group(1)); 
    } else if (m.start(2) != -1) { 
     System.out.println("logic_op\t\t" + m.group(2)); 
    } else { 
     System.out.println("relational_op\t\t" + m.group(3)); 
    } 
}

，你甚至可以通過結合常見的前綴，例如關鍵字進一步優化它as|asm可能成爲asm?，即as任選隨後m。會使關鍵字列表的可讀性降低，但性能會更好。

在上面的代碼中，我沒有，對於邏輯OPS，以顯示如何，並且還以固定的匹配誤差在原代碼，其中>=在line會出現3次爲=，>，>=在該順序，這個問題類似於問題中要求的子關鍵字問題。

來源

2017-10-11 05:04:42 Andreas

謝謝！我發現它讀取了某些組合符號，如你所說的那樣，其中'> ='將會分成3個邏輯符號。這也幫助了我。謝謝！ – quSci

添加\b爲字之間中斷你的正則表達式。所以：

Pattern.compile("\\b" + keywords[i] + "\\b")

將確保您單詞兩邊的字符不是字母。

這樣「記錄」將只匹配「的記載，」不是「或」。

來源

2017-10-10 03:50:09 EvanM

非常感謝！有效！ – quSci

雖然關鍵字是不太可能包含的特殊字符，你還是應該逃避它：'Pattern.compile（「\\ B」 + Pattern.quote（關鍵字[1]）+ 「\\ B」）' – Andreas

爪哇 - 匹配重讀單詞

回答

相關問題