代碼中的Scala PDFBox錯誤

寫了一個從PDF文檔讀取文本的函數。使用scala語言，Selenium，PDFBox 2.0.1。代碼中的Scala PDFBox錯誤

下面是代碼：

enter code here 
import org.openqa.selenium.firefox.{FirefoxBinary, FirefoxDriver, FirefoxProfile} 
import org.apache.pdfbox.pdfparser.PDFParser 
import org.apache.pdfbox.text.PDFTextStripper 
import java.io.BufferedInputStream 
def pdfreaddata { 
    driver.get("https://www.....pdf") 
    driver.manage.timeouts.implicitlyWait(50, TimeUnit.SECONDS) 
    val url: URL = new URL(driver.getCurrentUrl) 
    println(url) 
    val fileToParse: BufferedInputStream = new BufferedInputStream(url.openStream()) 
    val parser: PDFParser = new PDFParser(fileToParse) 
    parser.parse() 
    val output: String = new PDFTextStripper().getText(parser.getPDDocument) 
    println("pdf Value" + output) 
    parser.getPDDocument.close() 

    driver.manage.timeouts.implicitlyWait(100, TimeUnit.SECONDS) 
}

顯示爲PDFParser錯誤val parser: PDFParser = new PDFParser(fileToParse)

錯誤消息：

無法解析構造

試過代碼的Java太，得到同樣的錯誤。

來源

2016-05-17 Sera

正確的調用是PDDocument doc = PDDocument.load（stream）。使用新的PDFParser（）是一種過時的方法。但是，我不知道這是否是你的煩惱的原因。 –

您正在使用PDFBox版本2.x，但您顯然是遵循版本1.x的文檔。在2.0中沒有這樣的構造函數。有些東西已經改變，包括解析。按照migration guide或回落到1.8，因爲它看起來更有文件記錄，並有更多的在線材料。

來源

2016-05-17 07:49:24

使用pdfbox 1.8.12解決了構造問題。但即使是pdf也沒有密碼保護，它顯示爲加密。以下是使用Scala從pdf文檔中提取加密文本的最終代碼。將來可能對某人有用。

def pdfreaddata { 
driver.get("https://www....combo.pdf") 
driver.manage.timeouts.implicitlyWait(50, TimeUnit.SECONDS) 
val url: URL = new URL(driver.getCurrentUrl) 
println(url) 
val fileToParse: BufferedInputStream = new BufferedInputStream(url.openStream()) 
val parser: PDFParser = new PDFParser(fileToParse) 
parser.parse() 
val cosDocument:COSDocument = parser.getDocument() 
val pdDocument:PDDocument = new PDDocument(cosDocument) 
if(pdDocument.isEncrypted()) { 
    val sdm: StandardDecryptionMaterial = new StandardDecryptionMaterial(PDF_OWNER_PASSWORD)//PDF_OWNER_PASSWORD ="" 
    pdDocument.openProtection(sdm) 
} 
val output: String = new PDFTextStripper().getText(pdDocument) 
println("pdf Value" + output) 
parser.getPDDocument.close() 

driver.manage.timeouts.implicitlyWait(100, TimeUnit.SECONDS) 
} 
}

來源

2016-05-18 02:03:36 Sera

文件可以使用空的用戶密碼加密，這就是爲什麼。這經常發生，並且限制了權限（例如，禁止文本提取，打印等） –

代碼中的Scala PDFBox錯誤

回答

相關問題