如何閱讀帶有空格的pdf文件（實際上是）在c＃.net中使用iTextsharp行代碼行

我正在使用iText（for .net）來閱讀pdf文件。它讀取文檔，但是當有空格時，它只讀取一個空格。如何閱讀帶有空格的pdf文件（實際上是）在c＃.net中使用iTextsharp行代碼行

這使得無法通過獲取子字符串來提取數據。我想用空格逐行讀取數據，所以我知道文本的實際位置，因爲我想將數據寫入數據庫。

該文件是一個銀行對賬單，我想它轉儲到用於設計覈對系統的數據庫，

這裏是一個文件 FILE

以下的屏幕截圖是我使用的代碼

  For page As Integer = 1 To pdfReader.NumberOfPages 
      ' Dim strategy As ITextExtractionStrategy = New SimpleTextExtractionStrategy() 

      Dim Strategy As ITextExtractionStrategy = New iTextSharp.text.pdf.parser.LocationTextExtractionStrategy() 
      Dim currentText As String = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy) 
      currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.[Default], Encoding.UTF8, Encoding.[Default].GetBytes(currentText))) 


      Dim delimiterChars As Char() = {ControlChars.Lf} 

      Dim lines As String() = currentText.Split(delimiterChars) 

      Dim Bnk_Name As Boolean = True 
      Dim Br_Name As Boolean = False 
      Dim Name_acc As Boolean = False 
      Dim statment As Boolean = False 
      Dim Curr As Boolean = False 
      Dim Open As Boolean = False 
      Dim BankName = "" 
      Dim Branch = "" 
      Dim AccountNo = "" 
      Dim CompName = "" 
      Dim Currency = "" 
      Dim Statement_from = "" 
      Dim Statement_to = "" 
      Dim Opening_Balance = "" 
      Dim Closing_Balance = "" 
      Dim Narration As String = "" 
      For Each line As String In lines 

       line.Trim() 

       'BANK NAME 
       If Bnk_Name Then 
        If line.Trim() <> "" Then 
         BankName = line.Substring(0, 21) 
         Bnk_Name = False 
        Else 
         Bnk_Name = False 

        End If 
       End If

This Pic shows a sample that code read file

但我想，因爲它是爲空格閱讀位置

來源

2017-10-05 Umair Ikhtiar

您可以實施文本提取策略，嘗試通過爲大間隙插入多個空格字符來反映文本的水平佈局。對於iText/Java，在[本答案]（https://stackoverflow.com/a/24911617/1729265）中已經描述了基於「LocationTextExtractionStrategy」的內容。 – mkl

（看不到你的PDF，這個解釋是我能想到的最好的。）

你的文檔沒有包含任何空格。也就是說，文檔的內容流不包含空格。相反，渲染角色的指令只是考慮到需要在那裏的空間。

在這種情況下，iText必須「猜測」空間在哪裏。並且估計每當兩個字符進一步分開時，即插入1個空格，即正在使用的字體的空白字符的寬度。

可能這就是出問題的地方。

但同樣重要的是，您應該使用從不使用文本位置來提取數據。這種方法很容易出錯。

嘗試使用正則表達式與更好的ITextExtractionStrategy相結合。有一個ITextExtractionStrategy的實現可以讓你指定一個Rectangle。如果你這樣做，你可以以更準確的方式從文檔中獲取內容。

既然你正在處理的銀行對賬單，它應該很容易通過使用矩形爲基礎的搜索和正則表達式的組合（例如尋找的東西匹配的銀行賬戶號碼）

提取內容

來源

2017-10-05 07:03:20

您可以使用LocationTextExtractionStrategy。由於@Joris已經是answered，這個策略爲水平間距增加了最多一個空格字符。另一方面，您需要爲每個間隙填充一定數量的空白，這樣可以使結果代表PDF中文本行的水平佈局。

在this answer我曾經概述過如何構建這樣的文本提取策略。作爲a答案是iText/Java和bLocationTextExtractionStrategy已經改變了很多從那時起，我不認爲目前的問題是重複的，但。

A C＃從老回答當前iTextSharp的LocationTextExtractionStrategy使用反射，而不是複製類的想法的適應應該是這樣的：

class LayoutTextExtractionStrategy : LocationTextExtractionStrategy 
{ 
    public LayoutTextExtractionStrategy(float fixedCharWidth) 
    { 
     this.fixedCharWidth = fixedCharWidth; 
    } 

    MethodInfo DumpStateMethod = typeof(LocationTextExtractionStrategy).GetMethod("DumpState", BindingFlags.NonPublic | BindingFlags.Instance); 
    MethodInfo FilterTextChunksMethod = typeof(LocationTextExtractionStrategy).GetMethod("filterTextChunks", BindingFlags.NonPublic | BindingFlags.Instance); 
    FieldInfo LocationalResultField = typeof(LocationTextExtractionStrategy).GetField("locationalResult", BindingFlags.NonPublic | BindingFlags.Instance); 

    public override string GetResultantText(ITextChunkFilter chunkFilter) 
    { 
     if (DUMP_STATE) 
     { 
      //DumpState(); 
      DumpStateMethod.Invoke(this, null); 
     } 

     // List<TextChunk> filteredTextChunks = filterTextChunks(locationalResult, chunkFilter); 
     object locationalResult = LocationalResultField.GetValue(this); 
     List<TextChunk> filteredTextChunks = (List<TextChunk>)FilterTextChunksMethod.Invoke(this, new object[] { locationalResult, chunkFilter }); 
     filteredTextChunks.Sort(); 

     int startOfLinePosition = 0; 
     StringBuilder sb = new StringBuilder(); 
     TextChunk lastChunk = null; 
     foreach (TextChunk chunk in filteredTextChunks) 
     { 

      if (lastChunk == null) 
      { 
       InsertSpaces(sb, startOfLinePosition, chunk.Location.DistParallelStart, false); 
       sb.Append(chunk.Text); 
      } 
      else 
      { 
       if (chunk.SameLine(lastChunk)) 
       { 
        // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space 
        if (IsChunkAtWordBoundary(chunk, lastChunk)/* && !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text)*/) 
        { 
         //sb.Append(' '); 
         InsertSpaces(sb, startOfLinePosition, chunk.Location.DistParallelStart, !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text)); 
        } 

        sb.Append(chunk.Text); 
       } 
       else 
       { 
        sb.Append('\n'); 
        startOfLinePosition = sb.Length; 
        InsertSpaces(sb, startOfLinePosition, chunk.Location.DistParallelStart, false); 
        sb.Append(chunk.Text); 
       } 
      } 
      lastChunk = chunk; 
     } 

     return sb.ToString(); 
    } 

    private bool StartsWithSpace(String str) 
    { 
     if (str.Length == 0) return false; 
     return str[0] == ' '; 
    } 

    private bool EndsWithSpace(String str) 
    { 
     if (str.Length == 0) return false; 
     return str[str.Length - 1] == ' '; 
    } 

    void InsertSpaces(StringBuilder sb, int startOfLinePosition, float chunkStart, bool spaceRequired) 
    { 
     int indexNow = sb.Length - startOfLinePosition; 
     int indexToBe = (int)((chunkStart - pageLeft)/fixedCharWidth); 
     int spacesToInsert = indexToBe - indexNow; 
     if (spacesToInsert < 1 && spaceRequired) 
      spacesToInsert = 1; 
     for (; spacesToInsert > 0; spacesToInsert--) 
     { 
      sb.Append(' '); 
     } 
    } 

    public float pageLeft = 0; 
    public float fixedCharWidth = 6; 
}

正如你看到它需要一個float構造函數的參數fixedCharWidth。此參數表示PDF頁面中結果字符串中應與之對應的字符的寬度。它以PDF默認用戶空間單位給出（這種單位通常是/ in）。在目錄PDF的情況下，above mentioned question約爲（非常小的字體大小），值爲3是合適的;值爲6對於使用較大尺寸字體的大多數常見PDF來說似乎都適用。

來源

2017-10-05 12:47:06 mkl

如何閱讀帶有空格的pdf文件（實際上是）在c＃.net中使用iTextsharp行代碼行

回答

相關問題