2011-09-22 67 views
1

我正在開發一個顯示PDF並允許用戶訂購文檔副本的Web應用程序。我們希望在顯示PDF時快速添加文本,例如「未付費」或「樣品」。我已經完成了這個使用itextsharp。但是,頁面圖像很容易從水印文本中分離出來,並使用各種免費程序進行提取。PDF添加文本和拼合

如何將水印添加到PDF中的頁面上,但一起將頁面圖像和水印平坦化,使得水印成爲pdf頁面圖像的一部分,從而防止水印被移除(除非該人想要使用Photoshop)?

回答

2

如果我是你,我會走下一條不同的道路。使用iTextSharp(或其他庫)將給定文檔的每個頁面提取到文件夾。然後使用一些程序(Ghostscript,Photoshop,也許GIMP),您可以批量並將每個頁面轉換爲圖像。然後將覆蓋文字寫入圖像。最後使用iTextSharp將每個文件夾中的所有圖像合併到一個PDF中。

我知道這聽起來像一個痛苦,但你應該只需要這樣做,我假設每個文件一次。

如果你不想走這條路,讓我讓你繼續你需要做的提取圖像。下面的代碼大部分來自this post。在代碼的最後,我將圖像保存到桌面。既然你已經有了原始字節,所以你也可以很容易地將它們抽入一個System.Drawing.Image對象,並將它們寫回到一個新的對象中,這聽起來就像你熟悉的那樣。下面是一個完整的WinForms應用程序目標iTextSharp 5.1.1.0

Option Explicit On 
Option Strict On 

Imports iTextSharp.text 
Imports iTextSharp.text.pdf 
Imports System.IO 
Imports System.Runtime.InteropServices 

Public Class Form1 

    Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load 
     ''//File to process 
     Dim InputFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "SampleImage.pdf") 

     ''//Bind a reader to our PDF 
     Dim R As New PdfReader(InputFile) 

     ''//Setup some variable to use below 
     Dim bytes() As Byte 
     Dim obj As PdfObject 
     Dim pd As PdfDictionary 
     Dim filter, width, height, bpp As String 
     Dim pixelFormat As System.Drawing.Imaging.PixelFormat 
     Dim bmp As System.Drawing.Bitmap 
     Dim bmd As System.Drawing.Imaging.BitmapData 

     ''//Loop through all of the references in the file 
     Dim xo = R.XrefSize 
     For I = 0 To xo - 1 
      ''//Get the object 
      obj = R.GetPdfObject(I) 
      ''//Make sure we have something and that it is a stream 
      If (obj IsNot Nothing) AndAlso obj.IsStream() Then 
       ''//Case it to a dictionary object 
       pd = DirectCast(obj, PdfDictionary) 
       ''//See if it has a subtype property that is set to /IMAGE 
       If pd.Contains(PdfName.SUBTYPE) AndAlso pd.Get(PdfName.SUBTYPE).ToString() = PdfName.IMAGE.ToString() Then 
        ''//Grab various properties of the image 
        filter = pd.Get(PdfName.FILTER).ToString() 
        width = pd.Get(PdfName.WIDTH).ToString() 
        height = pd.Get(PdfName.HEIGHT).ToString() 
        bpp = pd.Get(PdfName.BITSPERCOMPONENT).ToString() 

        ''//Grab the raw bytes of the image 
        bytes = PdfReader.GetStreamBytesRaw(DirectCast(obj, PRStream)) 

        ''//Images can be encoded in various ways. /DCTDECODE is the simplest because its essentially JPEG and can be treated as such. 
        ''//If your PDFs contain the other types you will need to figure out how to handle those on your own 
        Select Case filter 
         Case PdfName.ASCII85DECODE.ToString() 
          Throw New NotImplementedException("Decoding this filter has not been implemented") 
         Case PdfName.ASCIIHEXDECODE.ToString() 
          Throw New NotImplementedException("Decoding this filter has not been implemented") 
         Case PdfName.FLATEDECODE.ToString() 
          ''//This code from https://stackoverflow.com/questions/802269/itextsharp-extract-images/1220959#1220959 
          bytes = pdf.PdfReader.FlateDecode(bytes, True) 
          Select Case Integer.Parse(bpp) 
           Case 1 
            pixelFormat = Drawing.Imaging.PixelFormat.Format1bppIndexed 
           Case 24 
            pixelFormat = Drawing.Imaging.PixelFormat.Format24bppRgb 
           Case Else 
            Throw New Exception("Unknown pixel format " + bpp) 
          End Select 
          bmp = New System.Drawing.Bitmap(Int32.Parse(width), Int32.Parse(height), pixelFormat) 
          bmd = bmp.LockBits(New System.Drawing.Rectangle(0, 0, Int32.Parse(width), Int32.Parse(height)), System.Drawing.Imaging.ImageLockMode.WriteOnly, pixelFormat) 
          Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length) 
          bmp.UnlockBits(bmd) 
          Using ms As New MemoryStream 
           bmp.Save(ms, System.Drawing.Imaging.ImageFormat.Jpeg) 
           bytes = ms.GetBuffer() 
          End Using 
         Case PdfName.LZWDECODE.ToString() 
          Throw New NotImplementedException("Decoding this filter has not been implemented") 
         Case PdfName.RUNLENGTHDECODE.ToString() 
          Throw New NotImplementedException("Decoding this filter has not been implemented") 
         Case PdfName.DCTDECODE.ToString() 
          ''//Bytes should be raw JPEG so they should not need to be decoded, hopefully 
         Case PdfName.CCITTFAXDECODE.ToString() 
          Throw New NotImplementedException("Decoding this filter has not been implemented") 
         Case PdfName.JBIG2DECODE.ToString() 
          Throw New NotImplementedException("Decoding this filter has not been implemented") 
         Case PdfName.JPXDECODE.ToString() 
          Throw New NotImplementedException("Decoding this filter has not been implemented") 
         Case Else 
          Throw New ApplicationException("Unknown filter found : " & filter) 
        End Select 

        ''//At this points the byte array should contain a valid JPEG byte data, write to disk 
        My.Computer.FileSystem.WriteAllBytes(Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), I & ".jpg"), bytes, False) 
       End If 
      End If 

     Next 

     Me.Close() 
    End Sub 
End Class 
1

整個頁面必須呈現爲圖像。否則,你會得到「文本對象」(文本的單個單詞/字母)和水印對象(疊加圖像),它們始終是頁面的不同部分。

+0

因爲文檔被掃描,沒有文本對象。整個頁面是一個圖像。事實上,水印是一個文本對象。但是,如果我可以使水印成爲圖像對象,那麼如何將水印圖像和頁面圖像合併爲一個圖像? – NYSystemsAnalyst

+0

以編程方式,您必須提取頁面圖像,將其與水印合並,然後用此新頁面替換原始頁面圖像。請注意,有些掃描儀會對文本進行OCR處理,並將其嵌入到pdf中,以繞過整個水印業務。 –

+0

有關如何提取頁面圖像並將其替換的任何提示?我瞭解OCR軟件,但他們選擇將文檔掃描爲沒有OCR的圖像,而且他們已經掃描了幾十萬個。 – NYSystemsAnalyst