2017-02-07 39 views
1

我有沿着線奇怪的輸入格式輸入:解析固定寬度使用ANTLR4

ACOMAND   1.0  1.0 
ACOMAND 
ACOMAND   1.0 
ACOMAND   1.0  1.0 1300.2     .9  1.0 
ACOMAND   1.0  1.0 1300.2     .9 
ACOMAND   OKK  1.0 1300.2     .9  1.0  WOW 
ACOMAND   1.0  1.0 1300.2 

每個在其自己的權利的命令,其中缺失或空白欄是隱含零。基本上,第一個字符串是左對齊的,其他所有字符都與第20,30,40,...,80列右對齊。第一列始終是一個ID。所有其他列都是ID或浮動。空列(填充空格或根本沒有)隱含爲零。

我該如何解析這個?

我想過:

grammar WeirdGrammar; 
comm: KEYWORD NEWLINE 
    | KEYWORD COLUMN NEWLINE 
    | KEYWORD COLUMN COLUMN NEWLINE 
    | KEYWORD COLUMN COLUMN COLUMN NEWLINE 
    | KEYWORD COLUMN COLUMN COLUMN COLUMN NEWLINE 
    | KEYWORD COLUMN COLUMN COLUMN COLUMN COLUMN NEWLINE 
    | KEYWORD COLUMN COLUMN COLUMN COLUMN COLUMN COLUMN NEWLINE 
    | KEYWORD COLUMN COLUMN COLUMN COLUMN COLUMN COLUMN COLUMN NEWLINE 
    ; 

KEYWORD: [A-Z] {getCharPositionInLine() == 1}? ([A-Z]|'-')* WS*? {getCharPositionInLine() == 10}? ; 
COLUMN: .+? {(getCharPositionInLine() % 10) == 0}? ; 
NEWLINE : '\r'? '\n' ; 
WS : [ \t] ; 

基本的想法是從只是一個KEYWORDKEYWORD其次是7個COLUMN S手的KEYWORDCOLUMN一路每一個組合。每個10的寬度限制是通過非正式地匹配任何東西來實施的,直到CharPosition與10的模爲零。關鍵字應該從行首開始,因此是該標記的第一條規則,那麼它應該不超過第10列,因此是第二個謂詞。目前,然而,這是行不通的,而不是返回:

line 1:0 mismatched input 'ACOMAND   1' expecting KEYWORD 

這仍然沒有處理尾隨空白,即使在我幼稚的做法,但我想這將是強加給沒有結尾的空白的問題。

+0

1)它的問題,檢查ACOMAND開始於第1列以及其他值都在一個對準固定位置,否則爲什麼不簡單'ID VALUE *?'2)請給出所有必要的語法,以便我們可以執行它。我錯過了WS和'隱式的令牌定義' – BernardK

+0

1)是的,它很重要,所以我認爲我必須使用謂詞來確保正確的對齊。 2)我現在添加了缺少的WS語法,我很遺憾將它遺漏了。 – rooms

回答

0

1)使用ANTLR 4.6和給定的語法和輸入,我有以下信息:

line 3:0 no viable alternative at input 'ACOMAND 1.0 1.0\nACOMAND\nACOMAND ' 

在調試語法,這是非常有用的列出了由詞法分析器看到的記號:

$ echo $CLASSPATH 
.:/usr/local/lib/antlr-4.6-complete.jar 
$ alias grun 
alias grun='java org.antlr.v4.gui.TestRig' 
$ grun Question question -tokens data.txt 
[@0,0:9='ACOMAND ',<KEYWORD>,1:0] 
[@1,10:19='  1.0',<COLUMN>,1:10] 
[@2,20:29='  1.0',<COLUMN>,1:20] 
[@3,30:30='\n',<COLUMN>,1:30] 
[@4,31:38='ACOMAND\n',<COLUMN>,2:0] 

4.6之前,令牌被顯示[@3,30:30='\n',<n>,1:30],你有哪些令牌已數n文件-grammar-.tokens在看。現在它翻譯得非常好,你馬上就會看到這個換行符被認爲是符號COLUMN,而不是你所期望的NEWLINE。這是因爲詞法分析器嘗試匹配序列中的每一個規則輸入:

  1. 確實'\n'比賽[A-Z]?不,所以它不是KEYWORD,下規則
  2. 確實'\n'匹配.+??是的,所以這是一個COLUMN,沒有機會 達到NEWLINE規則。

所以,你需要把COLUMN規則NEWLINE規則之後。

你也看到,輸入的第二線已經符號化的[@4,31:38='ACOMAND\n',<COLUMN>,2:0],因爲它不能被

KEYWORD: [A-Z] ... WS*? 

,因爲規則要求的白色空間,只有一個NL匹配。因此用(WS* | NEWLINE)代替WS*?

最後簡化冗餘規則:

grammar Question; 

question 
    : KEYWORD COLUMN* NEWLINE 
    ; 

KEYWORD : [A-Z] {getCharPositionInLine() == 1}? ([A-Z]|'-')* (WS* | NEWLINE) {getCharPositionInLine() <= 10}? ; 
NEWLINE : '\r'? '\n' ; 
WS : [ \t] ; 
COLUMN: .+? {(getCharPositionInLine() % 10) == 0}? ; 

現在詞法分析器提供:

[@0,0:9='ACOMAND ',<KEYWORD>,1:0] 
[@1,10:19='  1.0',<COLUMN>,1:10] 
[@2,20:29='  1.0',<COLUMN>,1:20] 
[@3,30:30='\n',<NEWLINE>,1:30] 
[@4,31:38='ACOMAND\n',<KEYWORD>,2:0] 

2)但這一切真的很有用嗎?解析器生成器是正確的工具嗎?刪除一個空間,看看會發生什麼:

line 2:0 extraneous input 'ACOMAND\n' expecting {NEWLINE, COLUMN} 

我認爲你應該離開詞法分析器做一個簡單的工作沒有這些位置的限制:創建非空白數據的令牌,並消除了空白。稍後在解析器或偵聽器中,您可以檢查位置:每個令牌具有諸如開始,停止,行等屬性。

爲什麼不是Ruby腳本? :-)

# Split 80 columns lines into 10 columns wide tokens, associate each token 
# with its stop position in line (counting from 1) and an OK/WRONG flag 
# if it is not aligned correctly. 

tokens = Array.new 

IO.readlines("data.txt").each_with_index do | line, i | 
    if i == 0 
    then 
     puts "   #{line}" 
     next 
    end 

    line_tokens = Array.new 
    line = line.chomp # remove NL 
    print "line #{i + 1} : " 
    8.times.each do | n | # n = 0 to 7 
     a = n * 10  # begin of split range counting from 0 
     b = n * 10 + 9 # end of range 
     token = line.slice(a..b) 
     next if token.nil? || token.length == 0 # nil if edge case 
     print token 
     good_position = 'OK' 
     position  = b + 1 

     case n 
     when 0 # first token must be at column 1 
      good_position = 'WRONG' if token[0] == ' ' 
     else # other tokens must be right aligned in their 10 columns width field 
      if token[-1] == ' ' && token != '   ' # not followed by NL 
      then 
       good_position = 'WRONG' 
       unless (pos = token.rindex(' ')).nil? 
        position = position - 10 + pos - 1 
       end 
      end 
      if token.length != 10 # last in line 
      then 
       good_position = 'WRONG' 
       position = position - 10 + token.length 
      end 
     end 

     line_tokens << [token.strip, position, good_position] 
     break if b > line.length 
    end 
    puts # print a NL because print doesn't do it 
    tokens << line_tokens 
end 

puts 
puts "Lists of tokens : " 
p tokens 

輸入data.txt中:

....+....1....+....2....+....3....+....4....+....5....+....6....+....7....+....8 
ACOMAND   1.0  1.0 
ACOMAND 
ACOMAND   1.0 
ACOMAND   1.0  1.0 1300.2    .9  1.0 
ACOMAND   1.0  1.0 1300.2     .9 
ACOMAND   OKK  1.0 1300.2     .9  1.0  WOW 
ACOMAND   1.0  1.0 1300.2 

輸出:

$ ruby -w split.rb 
     ....+....1....+....2....+....3....+....4....+....5....+....6....+....7....+....8 
line 2 : ACOMAND   1.0  1.0 
line 3 : ACOMAND 
line 4 : ACOMAND   1.0 
line 5 : ACOMAND   1.0  1.0 1300.2    .9  1.0 
line 6 : ACOMAND   1.0  1.0 1300.2     .9 
line 7 : ACOMAND   OKK  1.0 1300.2     .9  1.0  WOW 
line 8 : ACOMAND   1.0  1.0 1300.2 

Lists of tokens : 
[[["ACOMAND", 10, "OK"], ["1.0", 20, "OK"], ["1.0", 29, "WRONG"]], 
[["ACOMAND", 10, "OK"]], [["ACOMAND", 10, "OK"], ["1.0", 20, "OK"]], 
[["ACOMAND", 10, "OK"], ["1.0", 20, "OK"], ["1.0", 30, "OK"], ["1300.2", 
40, "OK"], ["", 50, "OK"], [".9", 58, "WRONG"], ["1.0", 68, "WRONG"]], 
[["ACOMAND", 10, "OK"], ["1.0", 20, "OK"], ["1.0", 30, "OK"], ["1300.2", 
40, "OK"], ["", 50, "OK"], [".9", 60, "OK"]], [["ACOMAND", 10, "OK"], 
["OKK", 20, "OK"], ["1.0", 30, "OK"], ["1300.2", 40, "OK"], ["", 50, 
"OK"], [".9", 60, "OK"], ["1.0", 70, "OK"], ["WOW", 80, "OK"]], 
[["ACOMAND", 10, "OK"], ["1.0", 20, "OK"], ["1.0", 30, "OK"], ["1300.2", 
40, "OK"]]]