將OCR的非結構化文本轉換爲正確的文本

我正在使用Microsoft MODI的VB6來OCR圖像。（我知道其他OCR工具，如正方體等，但我發現MODI比其他更準確）將OCR的非結構化文本轉換爲正確的文本

的圖像進行光學字符識別是這個樣子

enter image description here

和，文中我得到的OCR是後像下面那樣

Text1 
Text2 
Text3 
Number1 
Number2 
Number3

這裏的問題是，對面欄的相應文本沒有保留。如何將Number1與Text1映射？

我只能想到這樣的解決方案。

MODI提供的所有OCR化的詞座標這樣

LeftPos = Img.Layout.Words(0).Rects(0).Left 
TopPos = Img.Layout.Words(0).Rects(0).Top

所以要對齊同一行的話，我們可以匹配每個單詞的TopPos然後LeftPos排序。我們將獲得完整的產品線。所以我循環遍歷所有單詞，並將它們的文本以及左和頂部存儲在一個mysql表中。然後運行此查詢

SELECT group_concat(word ORDER BY `left` SEPARATOR ' ') 
FROM test_copy 
GROUP BY `top`

我的問題是，這頂位置不是每個字完全一樣，顯然會有幾個像素的差異。

我嘗試添加DIV 5，用於合併5像素範圍內但不適用於某些情況的單詞。我也嘗試過在node.js中通過計算每個單詞的寬容然後通過LeftPos排序，但我仍然覺得這不是最好的方法。

更新： js代碼完成這項工作，但除了Number1有5個像素差異並且Text2在該行中沒有對應的情況。

有沒有更好的想法做到這一點？

來源

2014-02-26 Салман

'Text1'和'Number1'是否總是存在（沒有間隙或缺失值）？ OCR軟件是否以任何順序將「Words」放在首位？ – tcarvin

不，任何東西都可以在那裏，空白，特殊的字符等等，一旦這些單詞排成一行，我有其他的邏輯來解析出有意義的信息。我不確定訂單的情況，但是當我們通過LeftPos對其進行分類時，無論如何都無關緊要。問題出在TopPos上：前4-6的詞（考慮到3的容忍度）應放在同一行。感謝您閱讀整個問題:)。 –

我不是100％確定如何識別那些位於「左」欄中的單詞，但是一旦識別出該單詞，就可以通過投影不僅僅是頂部座標而是通過投影整個矩形（頂部和底部）。確定與其他單詞的重疊（相交）。請注意下面以紅色標記的區域。

Horizontal projection

這是你可以用它來檢測，如果事情是在同一直線上的耐受性。如果一些東西只與一個像素重疊，那麼它可能來自較低或較高的線。但是，如果它與50％或更高的Text1重疊，那麼它可能在同一行上。

例SQL找到所有詞語的基於頂上「線」和底部座標

select 
    word.id, word.Top, word.Left, word.Right, word.Bottom 
from 
    word 
where 
    (word.Top >= @leftColWordTop and word.Top <= @leftColWordBottom) 
    or (word.Bottom >= @leftColWordTop and word.Bottom <= @leftColWordBottom)

實施例的僞代碼VB6計算線條。

'assume words is a collection of WordInfo objects with an Id, Top, 
' Left, Bottom, Right properties filled in, and a LineAnchorWordId 
' property that has not been set yet. 

'get the words in left-to-right order 
wordsLeftToRight = SortLeftToRight(words) 

'also get the words in top-to-bottom order 
wordsTopToBottom = SortTopToBottom(words) 

'pass through identifying a line "anchor", that being the left-most 
' word that starts (and defines) a line 
for each anchorWord in wordsLeftToRight 

    'check if the word has been mapped to aline yet by checking if 
    ' its anchor property has been set yet. This assumes 0 is not 
    ' a valid id, use -1 instead if needed 
    if anchorWord.LineAnchorWordId = 0 then 

     'not locate every word on this line, as bounded by the 
     ' anchorWord. every word determined to be on this line 
     ' gets its LineAnchorWordId property set to the Id of the 
     ' anchorWord 
     for each lineWord in wordsTopToBottom 

      if lineWord.Bottom < anchorWord.Top Then 

       'skip it,it is above the line (but keep searching down 
       ' because we haven't reached the anchorWord location yet) 

      else if lineWord.Top > anchorWord.Bottom Then 

       'skip it,it is below the line, and exit the search 
       ' early since all the rest will also be below the line 
       exit for 

      else if OverlapsWithinTolerance(anchorWord, lineWord) then 

       lineWord.LineAnchorWordId = anchorWord.Id 

      endif 

     next 

    end if 

next anchorWord 

'at this point, every word has been assigned a LineAnchorWordId, 
' and every word on the same line will have a matching LineAnchorWordId 
' value. If stored in a DB you can now group them by LineAnchorWordId 
' and sort them by their Left coord to get your output.

來源

2014-02-26 13:28:27 tcarvin

我理解這個概念，並且我也有所有座標用於投影矩形，但是我怎樣才能做到邏輯上？我的意思是我所能得到的只是他們的X和Y的字。發現單詞之間的重疊會太慢，我認爲。 –

你可以在代碼或數據庫中做到這一點。我不知道你的數據庫，但看看上面的編輯。 – tcarvin

添加了另一個代碼示例。 – tcarvin

將OCR的非結構化文本轉換爲正確的文本

回答

相關問題