Uima Ruta在火花上下文中的內存不足問題

我在Apache Spark上運行UIMA應用程序。有數百萬個頁面進入批處理，由UIMA RUTA進行處理。但有一段時間我面臨內存異常。它有時會拋出異常，因爲它成功處理了頁面，但在頁面上有一些時間失敗。Uima Ruta在火花上下文中的內存不足問題

應用程序日誌

Caused by: java.lang.OutOfMemoryError: Java heap space 
     at org.apache.uima.internal.util.IntArrayUtils.expand_size(IntArrayUtils.java:57) 
     at org.apache.uima.internal.util.IntArrayUtils.ensure_size(IntArrayUtils.java:39) 
     at org.apache.uima.cas.impl.Heap.grow(Heap.java:187) 
     at org.apache.uima.cas.impl.Heap.add(Heap.java:241) 
     at org.apache.uima.cas.impl.CASImpl.ll_createFS(CASImpl.java:2844) 
     at org.apache.uima.cas.impl.CASImpl.createFS(CASImpl.java:489) 
     at org.apache.uima.cas.impl.CASImpl.createAnnotation(CASImpl.java:3837) 
     at org.apache.uima.ruta.rule.RuleMatch.getMatchedAnnotations(RuleMatch.java:172) 
     at org.apache.uima.ruta.rule.RuleMatch.getMatchedAnnotationsOf(RuleMatch.java:68) 
     at org.apache.uima.ruta.rule.RuleMatch.getLastMatchedAnnotation(RuleMatch.java:73) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.mergeDisjunctiveRuleMatches(ComposedRuleElement.java:330) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:213) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362) 
     at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)

UIMA RUTA SCRIPT

WORDLIST EnglishStopWordList = 'stopWords.txt'; 
WORDLIST FiltersList = 'AnchorFilters.txt'; 
DECLARE Filters, EnglishStopWords; 
DECLARE Anchors, SpanStart,SpanClose; 

DocumentAnnotation{-> ADDRETAINTYPE(MARKUP)}; 

DocumentAnnotation{-> MARKFAST(Filters, FiltersList)}; 

STRING MixCharacterRegex = "[0-9]+[a-zA-Z]+"; 

DocumentAnnotation{-> MARKFAST(EnglishStopWords, EnglishStopWordList,true)}; 
(SW | CW | CAP) { -> MARK(Anchors, 1, 2)}; 
Anchors{CONTAINS(EnglishStopWords) -> UNMARK(Anchors)}; 

(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) (SW | CW | CAP) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)}; 
(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)? (SW | CW | CAP) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)}; 
(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) (SW | CW | CAP) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)? EnglishStopWords? { -> MARK(Anchors, 1, 4)}; 
(SW | CW | CAP) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 3)}; 

Anchors{CONTAINS(MARKUP) -> UNMARK(Anchors)}; 
MixCharacterRegex -> Anchors; 

"<Value>" -> SpanStart; 
"</Value>" -> SpanClose; 

Anchors{-> CREATE(ExtractedData, "type" = "ANCHOR", "value" = Anchors)}; 

SpanStart Filters? SPACE? ExtractedData SPACE? Filters? SpanClose{-> GATHER(Data, 2, 6, "ExtractedData" = 4)};

來源

2017-06-04 Gaurav

你可以添加一個鏈接到（模擬）文件，該文件會導致這樣的問題？ –

你使用哪種ruta版本？ –

通常情況下，在UIMA魯塔內存佔用率過高的原因可以在RutaBasic發現（許多註釋，覆蓋信息）或RuleMatch（低效規則，許多規則元素匹配）。

這是你的例子，這個問題似乎來自其他地方。堆棧跟蹤表明內存被一些析取規則元素用完，這需要創建新的註釋來存儲匹配信息。

看來UIMA Ruta的版本相當老舊，因爲行號與我所看到的源碼完全不匹配。

堆棧跟蹤中有continueOwnMatch的七個（!!!）調用。我正在尋找一個可能導致這樣的事情的規則，但沒有發現。這可能是一箇舊版本中已經修復的缺陷，或者一些預處理添加了額外的CW/SW/CAP註釋。

作爲第一個建議，我建議兩兩件事：

更新到UIMA魯塔2.6.0
擺脫所有析取規則元素的

選言規則元素不真的需要你的腳本。一般來說，如果不是真的需要，它們不應該使用。我根本不在生產性規則中使用它們。您可以簡單地寫W而不是(SW | CW | CAP)。您可以編寫ANY{OR(REGEXP("['\"-=()\\[\\]]"),IS(PM))}而不是(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)。

使用ANY作爲匹配條件可以降低運行時性能。在這個例子中，兩個規則，而不是規則字元素改寫可能會更好，例如，像

SPECIAL{REGEXP("['\"-=()\\[\\]]")} W ANY?{OR(REGEXP("['\"-=()\\[\\]]"),IS(PM))} EnglishStopWords? { -> MARK(Anchors, 1, 4)}; 
PM W ANY?{OR(REGEXP("['\"-=()\\[\\]]"),IS(PM))} EnglishStopWords? { -> MARK(Anchors, 1, 4)};

（在規則開始可選規則元素沒有任何錨的規則是不可選）

順便說一句，在你的規則中有很多優化的空間。如果我不得不猜測，我會說你可以擺脫至少一半的規則和所有創建的註釋的90％，這也將大大減少內存使用量。

免責聲明：我是UIMA魯塔開發商

來源

2017-06-08 19:54:41

我試圖改變規則，按您的建議，但有10-15％的性能下降 – Gaurav

好吧，這很奇怪。你之前有沒有重疊的錨？你如何評估性能（=精度？）？重寫不應改變結果。 –

重寫規則給我完全相同的結果。性能我這裏的意思是花時間使用anchors.I'm魯塔在火花批次獲得從頁面的錨來計算，以前它正在採取更少的時間獲得來自pages.No懷疑改寫5月錨以更少的內存，但我不現在有這樣的基準。通過增加執行記憶我沒有得到出內存異常，但因爲我有硬件我要找的魯塔改善現在的限制我沒有足夠的帶寬升級版本芸香現在，因爲它可能給 – Gaurav

Uima Ruta在火花上下文中的內存不足問題

回答

相關問題