我正在開發一個顯示PDF並允許用戶訂購文檔副本的Web應用程序。我們希望在顯示PDF時快速添加文本,例如「未付費」或「樣品」。我已經完成了這個使用itextsharp。但是,頁面圖像很容易從水印文本中分離出來,並使用各種免費程序進行提取。PDF添加文本和拼合
如何將水印添加到PDF中的頁面上,但一起將頁面圖像和水印平坦化,使得水印成爲pdf頁面圖像的一部分,從而防止水印被移除(除非該人想要使用Photoshop)?
我正在開發一個顯示PDF並允許用戶訂購文檔副本的Web應用程序。我們希望在顯示PDF時快速添加文本,例如「未付費」或「樣品」。我已經完成了這個使用itextsharp。但是,頁面圖像很容易從水印文本中分離出來,並使用各種免費程序進行提取。PDF添加文本和拼合
如何將水印添加到PDF中的頁面上,但一起將頁面圖像和水印平坦化,使得水印成爲pdf頁面圖像的一部分,從而防止水印被移除(除非該人想要使用Photoshop)?
如果我是你,我會走下一條不同的道路。使用iTextSharp(或其他庫)將給定文檔的每個頁面提取到文件夾。然後使用一些程序(Ghostscript,Photoshop,也許GIMP),您可以批量並將每個頁面轉換爲圖像。然後將覆蓋文字寫入圖像。最後使用iTextSharp將每個文件夾中的所有圖像合併到一個PDF中。
我知道這聽起來像一個痛苦,但你應該只需要這樣做,我假設每個文件一次。
如果你不想走這條路,讓我讓你繼續你需要做的提取圖像。下面的代碼大部分來自this post。在代碼的最後,我將圖像保存到桌面。既然你已經有了原始字節,所以你也可以很容易地將它們抽入一個System.Drawing.Image
對象,並將它們寫回到一個新的對象中,這聽起來就像你熟悉的那樣。下面是一個完整的WinForms應用程序目標iTextSharp 5.1.1.0
Option Explicit On
Option Strict On
Imports iTextSharp.text
Imports iTextSharp.text.pdf
Imports System.IO
Imports System.Runtime.InteropServices
Public Class Form1
Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
''//File to process
Dim InputFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "SampleImage.pdf")
''//Bind a reader to our PDF
Dim R As New PdfReader(InputFile)
''//Setup some variable to use below
Dim bytes() As Byte
Dim obj As PdfObject
Dim pd As PdfDictionary
Dim filter, width, height, bpp As String
Dim pixelFormat As System.Drawing.Imaging.PixelFormat
Dim bmp As System.Drawing.Bitmap
Dim bmd As System.Drawing.Imaging.BitmapData
''//Loop through all of the references in the file
Dim xo = R.XrefSize
For I = 0 To xo - 1
''//Get the object
obj = R.GetPdfObject(I)
''//Make sure we have something and that it is a stream
If (obj IsNot Nothing) AndAlso obj.IsStream() Then
''//Case it to a dictionary object
pd = DirectCast(obj, PdfDictionary)
''//See if it has a subtype property that is set to /IMAGE
If pd.Contains(PdfName.SUBTYPE) AndAlso pd.Get(PdfName.SUBTYPE).ToString() = PdfName.IMAGE.ToString() Then
''//Grab various properties of the image
filter = pd.Get(PdfName.FILTER).ToString()
width = pd.Get(PdfName.WIDTH).ToString()
height = pd.Get(PdfName.HEIGHT).ToString()
bpp = pd.Get(PdfName.BITSPERCOMPONENT).ToString()
''//Grab the raw bytes of the image
bytes = PdfReader.GetStreamBytesRaw(DirectCast(obj, PRStream))
''//Images can be encoded in various ways. /DCTDECODE is the simplest because its essentially JPEG and can be treated as such.
''//If your PDFs contain the other types you will need to figure out how to handle those on your own
Select Case filter
Case PdfName.ASCII85DECODE.ToString()
Throw New NotImplementedException("Decoding this filter has not been implemented")
Case PdfName.ASCIIHEXDECODE.ToString()
Throw New NotImplementedException("Decoding this filter has not been implemented")
Case PdfName.FLATEDECODE.ToString()
''//This code from https://stackoverflow.com/questions/802269/itextsharp-extract-images/1220959#1220959
bytes = pdf.PdfReader.FlateDecode(bytes, True)
Select Case Integer.Parse(bpp)
Case 1
pixelFormat = Drawing.Imaging.PixelFormat.Format1bppIndexed
Case 24
pixelFormat = Drawing.Imaging.PixelFormat.Format24bppRgb
Case Else
Throw New Exception("Unknown pixel format " + bpp)
End Select
bmp = New System.Drawing.Bitmap(Int32.Parse(width), Int32.Parse(height), pixelFormat)
bmd = bmp.LockBits(New System.Drawing.Rectangle(0, 0, Int32.Parse(width), Int32.Parse(height)), System.Drawing.Imaging.ImageLockMode.WriteOnly, pixelFormat)
Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length)
bmp.UnlockBits(bmd)
Using ms As New MemoryStream
bmp.Save(ms, System.Drawing.Imaging.ImageFormat.Jpeg)
bytes = ms.GetBuffer()
End Using
Case PdfName.LZWDECODE.ToString()
Throw New NotImplementedException("Decoding this filter has not been implemented")
Case PdfName.RUNLENGTHDECODE.ToString()
Throw New NotImplementedException("Decoding this filter has not been implemented")
Case PdfName.DCTDECODE.ToString()
''//Bytes should be raw JPEG so they should not need to be decoded, hopefully
Case PdfName.CCITTFAXDECODE.ToString()
Throw New NotImplementedException("Decoding this filter has not been implemented")
Case PdfName.JBIG2DECODE.ToString()
Throw New NotImplementedException("Decoding this filter has not been implemented")
Case PdfName.JPXDECODE.ToString()
Throw New NotImplementedException("Decoding this filter has not been implemented")
Case Else
Throw New ApplicationException("Unknown filter found : " & filter)
End Select
''//At this points the byte array should contain a valid JPEG byte data, write to disk
My.Computer.FileSystem.WriteAllBytes(Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), I & ".jpg"), bytes, False)
End If
End If
Next
Me.Close()
End Sub
End Class
整個頁面必須呈現爲圖像。否則,你會得到「文本對象」(文本的單個單詞/字母)和水印對象(疊加圖像),它們始終是頁面的不同部分。
因爲文檔被掃描,沒有文本對象。整個頁面是一個圖像。事實上,水印是一個文本對象。但是,如果我可以使水印成爲圖像對象,那麼如何將水印圖像和頁面圖像合併爲一個圖像? – NYSystemsAnalyst
以編程方式,您必須提取頁面圖像,將其與水印合並,然後用此新頁面替換原始頁面圖像。請注意,有些掃描儀會對文本進行OCR處理,並將其嵌入到pdf中,以繞過整個水印業務。 –
有關如何提取頁面圖像並將其替換的任何提示?我瞭解OCR軟件,但他們選擇將文檔掃描爲沒有OCR的圖像,而且他們已經掃描了幾十萬個。 – NYSystemsAnalyst