我試圖將古老的SGML文件中的合法文檔移動到數據庫中。在java中使用正則表達式,我有很好的運氣。但是,我遇到了一個小問題。看起來文件的每個部分的標籤在文件之間不是標準的。例如,最常見的標籤是:解析具有模糊標籤的結構化文檔中的數據
(<numeric>)
(<alpah>)
(<ROMAN>)
(<ALPHA>)
Ex。 (1)(a)(I)(A)
但是,還有其他文件有變化,有可能在()被拋出。我目前的算法具有與每個級別的每個元素相匹配的硬編碼RegEx。但我需要一種方法來動態設置每個級別的標籤類型,因爲我正在瀏覽文檔。
有沒有人遇到過這樣的問題?有沒有人有什麼建議?
在此先感謝。
編輯:
下面是我用它來解析出不同的項目RegExs:
Section: ^<tab>(<b>)?\d{1,4}(\.\d+)?-((\d{1,4}(\.\d+)?)(-|\.)?){3}
SubSection: \.?\s*(<\/b>|<tab>|^)\s*\(\d+(\.\d+)?\)\s+($|<b>|[A-Z"]|\([a-z](.\d+)?\)\s*(\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)\s*(\([A-Z](.\d+)?\))?)?\s*.)
Paragraph: (^|<tab>|\s+|\(\d+(\.\d+)?\)\s+)\([a-z](.\d+)?\)(\s+$|\s+<b>|\s+[A-Z"]|\s*\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)(\([A-Z](.\d+)?\))?\s*[A-Z"]?)
SubParagraph: (\)|<tab>|<\/b>)\s*\((XC|XL|L?X{0,3})(IX|IV|V?I{0,3})(\.\d+)?\)\s+($|[A-Z"<]|\([A-Z](.\d+)?\)\s*[A-Z"])
SubSubParagraph: (<tab>|\)\s*)\([A-Z](.\d+)?\)\s+([A-Z"]|$)
而且這裏的一些示例文本。我早點錯過了。雖然數據的最終來源是SGML,但我解析的東西略有不同。除了具有樣式標籤外,它或多或少都是純文本。
<tab><b>SECTION 5.</b> In Colorado Revised Statutes, 13-5-142, <b>amend</b> (1)
introductory portion, (1)(b), and (3)(b)(II) as follows:
<tab><b>13-5-142. National instant criminal background check system - reporting.</b>
(1) On and after March 20, 2013, the state court administrator shall send electronically
the following information to the Colorado bureau of investigation created pursuant to
section 24-33.5-401, referred to in this section as the "bureau":
<tab>(b) The name of each person who has been committed by order of the court to the
custody of the office of behavioral health in the department of human services pursuant
to section 27-81-112 or 27-82-108; and
<tab>(3) The state court administrator shall take all necessary steps to cancel a record
made by the state court administrator in the national instant criminal background check
system if:
<tab>(b) No less than three years before the date of the written request:
<tab>(II) The period of commitment of the most recent order of commitment or
recommitment expired, or a court entered an order terminating the person's incapacity or
discharging the person from commitment in the nature of habeas corpus, if the record in
the national instant criminal background check system is based on an order of
commitment to the custody of the office of behavioral health in the department of human
services; except that the state court administrator shall not cancel any record pertaining to
a person with respect to whom two recommitment orders have been entered pursuant to
section 27-81-112 (7) and (8), or who was discharged from treatment pursuant to section
27-81-112 (11) on the grounds that further treatment is not likely to bring about
significant improvement in the person's condition; or
SGML是否符合架構(DTD)?一般來說,當解析結構化數據時,最好使用標準解析器而不是正則表達式。 –
我應該提到SGML結構不好。從我所知道的情況來看,這些文檔的開發人員使用樣式來定義每個項目。每個項目都沒有可能的描述性標籤。 – Thomas
您能否提供更多正確且形式不當的SGML示例並提供您想要的輸出示例?另外,你可以發佈你試過的正則表達式,這樣我們可以1)檢查它們2)編輯它以工作(如果可能的話)和3)不嘗試你已經嘗試過的東西 – ctwheels