解析這種腳本語言的最有效的方法

我爲一個已經過時的文本編輯器的腳本語言實現了一個解釋器，並且我在使一個詞法分析器正常工作時遇到了一些麻煩。解析這種腳本語言的最有效的方法

這裏是語言的問題部分的例子：

T 
L /LOCATE ME/ 
C /LOCATE ME/CHANGED ME/ * * 
C ;CHANGED ME;CHANGED ME AGAIN; 1 *

的/人物似乎引用字符串，也充當在sed型語法C（CHANGE）命令分隔符，雖然它允許任何字符作爲分隔符。

我可能實現了大約一半最常用的命令，直到現在才使用parse_tokens(line.split())。這是快速和骯髒的，但它的工作出人意料地好。

爲了避免寫我自己的詞法分析器，我試過shlex。

它工作得很好，除了CHANGE情況：

import shlex 

def shlex_test(cmd_str): 
    lex = shlex.shlex(cmd_str) 
    lex.quotes = '/' 
    return list(lex) 

print(shlex_test('L /spaced string/')) 
# OK! gives: ['L', '/spaced string/'] 

print(shlex_test('C /spaced string/another string/ * *')) 
# gives : ['C', '/spaced string/', 'another', 'string/', '*', '*'] 
# desired : any format that doesn't split on a space between /'s 

print(shlex_test('C ;a b;b a;')) 
# gives : ['C', ';', 'b', 'a', ';', 'a', 'b', ';'] 
# desired : same format as CHANGE command above

任何人都知道一個簡單的方法來做到這一點（與shlex或其他）？

編輯：

如果有幫助，這裏是在幫助文件中給出的CHANGE命令語法：

''' 
C [/stg1/stg2/ [n|n m]] 

    The CHANGE command replaces the m-th occurrence of "stg1" with "stg2" 
for the next n lines. The default value for m and n is 1.'''

的同樣困難來標記X和Y命令：

''' 
X [/command/[command/[...]]n] 
Y [/command/[command/[...]]n] 

    The X and Y commands allow the execution of several commands contained 
in one command. To define an X or Y "command string", enter X (or Y) 
followed by a space, then individual commands, each separated by a 
delimiter (e.g. a period "."). An unlimited number of commands may be 
placed in the X or Y command string. Once the command string has been 
defined, entering X (or Y) followed optionally by a count n will execute 
the defined command string n times. If n is not specified, it will 
default to 1.'''

來源

2012-07-19 Robbie Rosati

您有權訪問語言定義嗎？如果是這樣，相關部分的引用可能對我們所有人都有用。 – Marcin 2012-07-19 17:03:28

@Marcin我從幫助文件中添加了一些相關信息，這是我擁有的所有文檔。 – 2012-07-19 17:24:28

我不知道'shlex'，但我認爲'regex' [（re）]（http://docs.python.org/library/re.html）也是有用的。 – machaku 2012-07-19 17:29:32

的問題可能是/不代表引號，而只是用於分隔。我猜測第三個字符總是用來定義分隔符。此外，您不需要輸出/或;，是嗎？

我只是做了以下只拆分爲L和C命令的情況下：

>>> def parse(cmd): 
...  delim = cmd[2] 
...  return cmd.split(delim) 
... 
>>> c_cmd = "C /LOCATE ME/CHANGED ME/ * *" 
>>> parse(c_cmd) 
['C ', 'LOCATE ME', 'CHANGED ME', ' * *'] 

>>> c_cmd2 = "C ;a b;b a;" 
>>> parse(c_cmd2) 
['C ', 'a b', 'b a', ''] 

>>> l_cmd = "L /spaced string/" 
>>> parse(l_cmd) 
['L ', 'spaced string', '']

對於你可能最後一個列表元素上使用split(" ")可選" * *"部分。

>>> parse(c_cmd)[-1].split(" ") 
['', '*', '*']

來源

2012-07-19 21:19:08 sevenforce

不幸的是，它不是*總是*第三個字符，但我會嘗試這種方法併發回，謝謝。 – 2012-07-20 13:04:56

解析這種腳本語言的最有效的方法

回答

相關問題