使用多態對正則表達式解析器建模

因此，我爲學校創建了一個正則表達式解析器，該解析器創建負責匹配的對象的層次結構。我決定做到面向對象，因爲我更容易想象這種語法的實現。所以，這些是我組成正則表達式的類。這些都是用Java編寫的，但是我認爲如果你精通任何面向對象的語言，你可以繼續學習。我們需要實現的唯一操作符是Union（+），Kleene-Star（*），表達式的連接（ab或者可能是（a + b）c），當然括括號中的例子。這就是我現在實施的方式，而且我的工作方式就像在主體中有一點額外開銷的魅力一樣。使用多態對正則表達式解析器建模

父類，Regexp.java

public abstract class Regexp { 

    //Print out the regular expression it's holding 
    //Used for debugging purposes 
    abstract public void print(); 

    //Checks if the string matches the expression it's holding 
    abstract public Boolean match(String text); 

    //Adds a regular expression to be operated upon by the operators 
    abstract public void add(Regexp regexp); 

    /* 
    *To help the main with the overhead to help it decide which regexp will 
    *hold the other 
    */ 
    abstract public Boolean isEmpty(); 

}

有最簡單的正則表達式，Base.java，持有一個字符，如果字符串的字符匹配返回true。

public class Base extends Regexp{ 
    char c; 

    public Base(char c){ 
     this.c = c; 
    } 

    public Base(){ 
     c = null; 
    } 

    @Override 
    public void print() { 
     System.out.println(c); 
    } 

    //If the string is the char, return true 
    @Override 
    public Boolean match(String text) { 
     if(text.length() > 1) return false; 
     return text.startsWith(""+c); 
    } 

    //Not utilized, since base is only contained and cannot contain 
    @Override 
    public void add(Regexp regexp) { 

    } 

    @Override 
    public Boolean isEmpty() { 
     return c == null; 
    } 

}

一個圓括號Paren.java，在裏面放置一個正則表達式。這裏沒有什麼特別的想法，但說明了匹配的工作原理。

public class Paren extends Regexp{ 
    //Member variables: What it's holding and if it's holding something 
    private Regexp regexp; 
    Boolean empty; 

    //Parenthesis starts out empty 
    public Paren(){ 
     empty = true; 
    } 

    //Unless you create it with something to hold 
    public Paren(Regexp regexp){ 
     this.regexp = regexp; 
     empty = false; 
    } 

    //Print out what it's holding 
    @Override 
    public void print() { 
     regexp.print(); 
    } 

    //Real simple; either what you're holding matches the string or it doesn't 
    @Override 
    public Boolean match(String text) { 
     return regexp.match(text); 
    } 

    //Pass something for it to hold, then it's not empty 
    @Override 
    public void add(Regexp regexp) { 
     this.regexp = regexp; 
     empty = false; 
    } 

    //Return if it's holding something 
    @Override 
    public Boolean isEmpty() { 
     return empty; 
    } 

}

Union.java，它是兩個可匹配的正則表達式。如果其中一個匹配，整個聯盟就是一場比賽。

public class Union extends Regexp{ 
    //Members 
    Regexp lhs; 
    Regexp rhs; 

    //Indicating if there's room to push more stuff in 
    private Boolean lhsEmpty; 
    private Boolean rhsEmpty; 

    public Union(){ 
     lhsEmpty = true; 
     rhsEmpty = true; 
    } 

    //Can start out with something on the left side 
    public Union(Regexp lhs){ 
     this.lhs = lhs; 

     lhsEmpty = false; 
     rhsEmpty = true; 
    } 

    //Or with both members set 
    public Union(Regexp lhs, Regexp rhs) { 
     this.lhs = lhs; 
     this.rhs = rhs; 

     lhsEmpty = false; 
     rhsEmpty = false; 
    } 

    //Some stuff to help me see the unions format when I'm debugging 
    @Override 
    public void print() { 
     System.out.println("("); 
     lhs.print(); 
     System.out.println("union"); 
     rhs.print(); 
     System.out.println(")"); 

    } 

    //If the string matches the left side or right side, it's a match 
    @Override 
    public Boolean match(String text) { 
     if(lhs.match(text) || rhs.match(text)) return true; 
     return false; 
    } 

    /* 
    *If the left side is not set, add the member there first 
    *If not, and right side is empty, add the member there 
    *If they're both full, merge it with the right side 
    *(This is a consequence of left-to-right parsing) 
    */ 
    @Override 
    public void add(Regexp regexp) { 
     if(lhsEmpty){ 
     lhs = regexp; 

     lhsEmpty = false; 
     }else if(rhsEmpty){ 
      rhs = regexp; 

      rhsEmpty = false; 
     }else{ 
      rhs.add(regexp); 
     } 
    } 

    //If it's not full, it's empty 
    @Override 
    public Boolean isEmpty() { 
     return (lhsEmpty || rhsEmpty); 
    } 
}

甲級聯，Concat.java，這基本上是鏈接在一起正則表達式的清單。這個很複雜。

public class Concat extends Regexp{ 
    /* 
    *The list of regexps is called product and the 
    *regexps inside called factors 
    */ 
    List<Regexp> product; 

    public Concat(){ 
     product = new ArrayList<Regexp>(); 
    } 

    public Concat(Regexp regexp){ 
     product = new ArrayList<Regexp>(); 
     pushRegexp(regexp); 
    } 

    public Concat(List<Regexp> product) { 
     this.product = product; 
    } 

    //Adding a new regexp pushes it into the list 
    public void pushRegexp(Regexp regexp){ 
     product.add(regexp); 
    } 
    //Loops over and prints them 
    @Override 
    public void print() { 
     for(Regexp factor: product){ 
      factor.print(); 
     } 
    } 

    /* 
    *Builds up a substring approaching the input string. 
    *When it matches, it builds another substring from where it 
    *stopped. If the entire string has been pushed, it checks if 
    *there's an equal amount of matches and factors. 
    */ 
    @Override 
    public Boolean match(String text) { 
     ArrayList<Boolean> bools = new ArrayList<Boolean>(); 

     int start = 0; 
     ListIterator<Regexp> itr = product.listIterator(); 

     Regexp factor = itr.next(); 

     for(int i = 0; i <= text.length(); i++){ 
      String test = text.substring(start, i); 

      if(factor.match(test)){ 
        start = i; 
        bools.add(true); 
        if(itr.hasNext()) 
         factor = itr.next(); 
      } 
     } 

     return (allTrue(bools) && (start == text.length())); 
    } 

    private Boolean allTrue(List<Boolean> bools){ 
     return product.size() == bools.size(); 
    } 

    @Override 
    public void add(Regexp regexp) { 
     pushRegexp(regexp); 
    } 

    @Override 
    public Boolean isEmpty() { 
     return product.isEmpty(); 
    } 
}

再一次，我已經得到了這些工作讓我滿意我的開銷，標記化和所有那些好東西。現在我想介紹一下Kleene-star的操作。它匹配文本中出現的任何數字，甚至是0。所以，ba *會匹配b，ba，baa，baaa等，而（ba）*會匹配ba，baba，bababa等。它甚至看起來可能擴展我的正則表達式到這個，或者你看到解決這個問題的另一種方式？ PS：有一些getters，setter和其他所有我沒有寫出來的支持函數，但這主要是爲了讓你快速瞭解這些類如何工作。

來源

2015-02-08 pimmen

您似乎試圖使用回退算法來執行解析。這可以起作用 - 儘管用高階函數做起來更容易 - 但它不是解析正則表達式的最佳方式（我的意思是指數學上正則表達式的東西，而不是解析全部的東西由各種語言的「正則表達式」庫實現的語言）。

這不是最好的方法，因爲解析時間與要匹配的字符串的大小不是線性的;實際上，它可以是指數型的。但要理解這一點，理解爲什麼當前的實現存在問題很重要。

考慮相當簡單的正則表達式(ab+a)(bb+a)。這可以完全匹配四個字符串：abbb,aba,abb,aa。所有這些字符串都以a開頭，所以您的級聯算法將匹配位置1上的第一個concatenand和（(ab+a)），然後繼續嘗試第二個concatenand（bb+a）。這將成功匹配abb和aa，但它將在aba和abbb上失敗。

現在，假設您修改了連接函數以選擇最長的匹配子串而不是最短的子串。在這種情況下，第一個子表達式將在三個可能的字符串（除aa之外）中匹配ab，並且匹配在abb的情況下將失敗。

總之，當你匹配的串聯R·S，你需要做的是這樣的：

找到相匹配R
看看S文本的其他部分相匹配的一些初始的字符串
如果不是，則用另一個匹配的起始字符串重複R

在這種情況下的完整正則表達式匹配，我們列出的順序與R匹配的順序並不重要，但通常我們試圖找到與正則表達式匹配的最長的子字符串，因此枚舉可能的匹配從最長到最短是很方便的。

這樣做意味着我們需要能夠在發生下游故障後重新啓動匹配，以找到「下一個匹配」。這並不複雜，但它確實使接口複雜化了，因爲所有複合正則表達式操作符都需要將失敗「傳遞」給子女，以便找到下一個選擇。也就是說，運營商R+S可能會首先找到與R匹配的內容。如果詢問是否有下一個可能性，則首先必須詢問R，如果有其他字符串可以匹配，則在轉到S之前。（這是關於如何讓+按長度順序列出匹配的問題。）

通過這樣的實現，很容易看到如何實現Kleene星（R^*），它也很容易看看它爲什麼會花費指數的時間。一種可能的實現：

首先，匹配儘可能多的R越好。
如果要求另一場比賽：問最後一個R另一場比賽
如果沒有更多的可能性，從列表中刪除最後R，請問現在最後R的是另一場比賽
如果沒有那工作，提出了空字符串作爲匹配
失敗

（這可以通過遞歸被簡化：匹配一個R，再搭配一個R^*對於下一場比賽，第一次嘗試下R^*;未能嘗試下一個R和第一個以下R^*;當所有其他都失敗時，嘗試空字符串。）

實現這是一個有趣的編程練習，所以我鼓勵你繼續。但請注意，有更好的算法。您可能想要閱讀Russ Cox's interesting essays on regular expression matching。

來源

2015-02-08 17:51:36 rici

讀者：Russ Cox的散文很漂亮。 – 2015-02-08 18:44:28

謝謝！這非常有幫助！ – pimmen 2015-02-08 19:30:45

我已經通過使用你的方法工作，但它非常緩慢並且隨着每一層呈指數級增長（例如，a（b（a + b））有兩層括號），並且我已將它交給並通過了它，但如果這是一個分級任務，我不會使用這個非常慢的算法。五分鐘需要一分鐘，七分鐘需要半小時。我會給這篇文章一看，也許會實現一個更好的。 – pimmen 2015-02-10 13:34:04

使用多態對正則表達式解析器建模

回答

相關問題