2017-09-29 92 views
1

我試圖將古老的SGML文件中的合法文檔移動到數據庫中。在java中使用正則表達式,我有很好的運氣。但是,我遇到了一個小問題。看起來文件的每個部分的標籤在文件之間不是標準的。例如,最常見的標籤是:解析具有模糊標籤的結構化文檔中的數據

(<numeric>) 
    (<alpah>) 
     (<ROMAN>) 
      (<ALPHA>) 

Ex。 (1)(a)(I)(A)

但是,還有其他文件有變化,有可能在()被拋出。我目前的算法具有與每個級別的每個元素相匹配的硬編碼RegEx。但我需要一種方法來動態設置每個級別的標籤類型,因爲我正在瀏覽文檔。

有沒有人遇到過這樣的問題?有沒有人有什麼建議?

在此先感謝。

編輯:

下面是我用它來解析出不同的項目RegExs:

Section: ^<tab>(<b>)?\d{1,4}(\.\d+)?-((\d{1,4}(\.\d+)?)(-|\.)?){3} 
SubSection: \.?\s*(<\/b>|<tab>|^)\s*\(\d+(\.\d+)?\)\s+($|<b>|[A-Z"]|\([a-z](.\d+)?\)\s*(\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)\s*(\([A-Z](.\d+)?\))?)?\s*.) 
Paragraph: (^|<tab>|\s+|\(\d+(\.\d+)?\)\s+)\([a-z](.\d+)?\)(\s+$|\s+<b>|\s+[A-Z"]|\s*\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)(\([A-Z](.\d+)?\))?\s*[A-Z"]?) 
SubParagraph: (\)|<tab>|<\/b>)\s*\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)\s+($|[A-Z"<]|\([A-Z](.\d+)?\)\s*[A-Z"]) 
SubSubParagraph: (<tab>|\)\s*)\([A-Z](.\d+)?\)\s+([A-Z"]|$) 

而且這裏的一些示例文本。我早點錯過了。雖然數據的最終來源是SGML,但我解析的東西略有不同。除了具有樣式標籤外,它或多或少都是純文本。

<tab><b>SECTION 5.</b> In Colorado Revised Statutes, 13-5-142, <b>amend</b> (1) 
introductory portion, (1)(b), and (3)(b)(II) as follows: 

<tab><b>13-5-142. National instant criminal background check system - reporting.</b> 
(1) On and after March 20, 2013, the state court administrator shall send electronically 
the following information to the Colorado bureau of investigation created pursuant to 
section 24-33.5-401, referred to in this section as the "bureau": 

<tab>(b) The name of each person who has been committed by order of the court to the 
custody of the office of behavioral health in the department of human services pursuant 
to section 27-81-112 or 27-82-108; and 

<tab>(3) The state court administrator shall take all necessary steps to cancel a record 
made by the state court administrator in the national instant criminal background check 
system if: 

<tab>(b) No less than three years before the date of the written request: 

<tab>(II) The period of commitment of the most recent order of commitment or 
recommitment expired, or a court entered an order terminating the person's incapacity or 
discharging the person from commitment in the nature of habeas corpus, if the record in 
the national instant criminal background check system is based on an order of 
commitment to the custody of the office of behavioral health in the department of human 
services; except that the state court administrator shall not cancel any record pertaining to 
a person with respect to whom two recommitment orders have been entered pursuant to 
section 27-81-112 (7) and (8), or who was discharged from treatment pursuant to section 
27-81-112 (11) on the grounds that further treatment is not likely to bring about 
significant improvement in the person's condition; or 
+0

SGML是否符合架構(DTD)?一般來說,當解析結構化數據時,最好使用標準解析器而不是正則表達式。 –

+0

我應該提到SGML結構不好。從我所知道的情況來看,這些文檔的開發人員使用樣式來定義每個項目。每個項目都沒有可能的描述性標籤。 – Thomas

+0

您能否提供更多正確且形式不當的SGML示例並提供您想要的輸出示例?另外,你可以發佈你試過的正則表達式,這樣我們可以1)檢查它們2)編輯它以工作(如果可能的話)和3)不嘗試你已經嘗試過的東西 – ctwheels

回答

1

您對該問題的陳述含糊不清,所以唯一可能的答案是一般方法。我一直在處理這種不精確格式的文檔轉換。

CS可以幫助的工具是狀態機。如果可以檢測到(例如,使用正則表達式)格式正在改變爲新的約定,這是適當的。這會改變狀態,在這種情況下,它相當於翻譯器用於當前和隨後的文本塊。它在下一個狀態改變之前一直有效。總體來說,算法是這樣的:

translator = DEFAULT 
while (chunks of input remain) { 
    chunk = GetNextChunkOfInput // a line, paragraph, etc. 
    new_translator = ScanChunkForStateChange(chunk, translator) 
    if (new_translator != null) translator = new_translator // found a state change! 
    print(translator.Translate(chunk)) // use the translator on the chunk 
} 

在這個框架內,這是一個繁瑣的過程來設計的筆譯和狀態改變謂語。你所希望做的就是嘗試,檢查輸出結果並修復問題,重複直到你無法改善爲止。此時,您可能已經在輸入中發現了最大結構,因此單獨使用模式匹配的算法(無需嘗試使用AI進行語義建模)不會讓您變得更遠。

+0

謝謝基因。我調整了我的算法,使其更接近您的僞代碼,並獲得更好的結果。就像你說的,我應該能夠調整它以獲得更好的結果。 – Thomas

0

文字摘要你貼可以通過SGML解析器使用自定義的語法規則在DOCTYPE又名DTD進行解析和結構(假設在你的榜樣<tab>表示實際tab開始元素標籤,而不是一個TAB字符)。我已經採取了你的片段,將其存儲在一個名爲data.ent文件,然後創建以下文件SGML,doc.sgm,引用它:

<!DOCTYPE doc [ 
    <!ELEMENT doc O O (tab)+> 
    <!ELEMENT tab - O (((b,c?)|c),text)> 
    <!ELEMENT text O O (#PCDATA|b)+> 
    <!ELEMENT b - - (#PCDATA)> 
    <!ELEMENT c - - (#PCDATA)> 
    <!ENTITY data SYSTEM "data.ent"> 
    <!ENTITY startc "<c>"> 
    <!ENTITY endc "</c>"> 
    <!SHORTREF intab "(" startc ")" endc> 
    <!USEMAP intab tab> 
    <!USEMAP #EMPTY text> 
]> 
&data 

這些DTD規則解析您的數據的結果(在使用osgmlnorm doc.sgm命令行)如下:

<DOC> 
    <TAB> 
    <B>SECTION 5.</B> 
    <TEXT>In Colorado Revised Statutes, 13-5-142, <B>amend</B> (1) 
     introductory portion, (1)(b), and (3)(b)(II) as follows: 
    </TEXT> 
    </TAB> 
    <TAB> 
    <B>13-5-142. National instant criminal background check system 
     reporting.</B> 
    <C>1</C> 
    <TEXT>On and after March 20, 2013, the state court administrator 
     shall send electronically the following information to the 
     Colorado bureau of investigation created pursuant to section 
     24-33.5-401, referred to in this section as the "bureau": 
    </TEXT> 
    </TAB> 
    <TAB> 
    <C>b</C> 
    <TEXT>The name of each person who has been committed by order 
     of the court to the custody of the office of behavioral health 
     in the department of human services pursuant to section 27-81-112 
     or 27-82-108; and 
    </TEXT> 
    </TAB> 
    <TAB> 
    <C>3</C> 
    <TEXT>The state court administrator shall take all necessary steps 
     to cancel a record made by the state court administrator in the 
     national instant criminal background check system if: 
    </TEXT> 
    </TAB> 
    <TAB> 
    <C>b</C> 
    <TEXT>No less than three years before the date of the written 
     request: 
    </TEXT> 
    </TAB> 
    <TAB> 
    <C>II</C> 
    <TEXT>The period of commitment of the most recent order of 
     commitment or recommitment expired, or a court entered an order 
     terminating the person's incapacity or discharging the person 
     from commitment in the nature of habeas corpus, if the record in 
     the national instant criminal background check system is based on 
     an order of commitment to the custody of the office of behavioral 
     health in the department of human services; except that the state 
     court administrator shall not cancel any record pertaining to 
     a person with respect to whom two recommitment orders have been 
     entered pursuant to section 27-81-112 (7) and (8), or who was 
     discharged from treatment pursuant to section 27-81-112 (11) on 
     the grounds that further treatment is not likely to bring about 
     significant improvement in the person's condition; or 
    </TEXT> 
    </TAB> 
</DOC> 

說明:

  • 的SGML DTD我創建使用SGML標籤推論來推斷一個虛構的DOC 元素作爲文檔元素,以及人造TEXTC元素; 的主要目的是強加文件結構的 TAB元件,每個包含部分標識符(如 <b>SECTION 5.</b>(c)),隨後部分主體文本的序列
  • 我也由一個特設的元件C包裝部分標識符 文字放在大括號中(() characters);由於 DTD的SHORTREF映射規則,由SGML處理器自動插入起始端元件 C;這些告訴SGML,一個TAB 元件內,SGML應由endc實體的 值(其擴展到</C>)代替由 startc實體(其擴展爲<C>)的值的所有(字符,並且所有)字符
  • <!USEMAP #EMPTY text>關閉括號的擴張在TAB節這樣的 TEXT身體部位引用(7)(8)在 正文沒有得到改變(雖然這些可能變成類似HTML的 鏈接以及使用SGML)

如果您使用<tab>表示TAB(ASCII 9)字符,SGML也可以處理它,例如,通過將TAB字符翻譯爲使用SHORTREF規則的<TAB>標籤。

請注意您需要安裝osgmlnorm程序;如果您使用的是Ubuntu,則可以使用sudo apt-get install opensp進行安裝,在其他Linux變體和Mac OS上使用類似的方法進行安裝。對於您的應用程序,您可能需要使用osx程序(也是OpenSP的一部分)將標準化的解析結果輸出到XML(儘管上面顯示的輸出可以解析爲XML),然後使用Java XML API處理結構化內容滿足您的需求。