來自PDF的高分辨率圖像

我正在開發一個項目，我需要從多頁PDF中提取每頁TIFF。 PDF只包含圖像，每頁有一個圖像（我相信它們是用某種複印機/掃描儀制作的，但沒有證實）。然後使用TIFF創建文檔的其他衍生版本，因此分辨率越高越好。來自PDF的高分辨率圖像

我發現兩個食譜，都有幫助的方面，但都不理想。希望有人能幫我調整其中一個，或者提供第三個選項。

配方1，pdfimages和ImageMagick的：

首先做的事：

$ pdfimages $MY_PDF.pdf foo"

導致幾個.pbm文件（命名爲foo-000.pbm，foo-001.pbm）等

然後對於每個*.pbm做：

$ convert $each -resize 3200x3200\> -quality 100 $new_name.tif

臨：得到的TIFF格式是在長尺寸的健康3300+像素，（調整大小隻是用來正常化的一切）

缺點：頁面的方向丟失，就出來旋轉不同的方向（他們遵循邏輯模式，所以他們可能是他們被送到掃描儀的方向？）。

配方2 ImageMagick的獨奏：

convert +adjoin $MY_PDF.pdf pages.tif

這給我的單頁TIFF（pages-0.tif，pages-1.tif，等等）。

專業：取向留！

Con：結果文件的長度是< 800像素，它太小而不實用，看起來好像應用了一些壓縮。

我該如何消除PDF中圖像流的縮放比例，但保留方向？ ImageMagick中是否還有一些我失蹤的magick？還有其他的東西嗎？

來源

2012-01-11 JStroop

你是否願意使用非免費的解決方案？ – BitBank 2012-01-12 00:35:16

也許 - 它需要有一個API（沒有GUI）並且要合理地集成;我正在處理數以萬計的文檔。你有什麼考慮？ – JStroop 2012-01-12 03:03:23

寫信給我的細節，我會看看我是否可以幫忙（[email protected]）。 – BitBank 2012-01-12 03:28:57

對不起，這個老話題了噪音，但谷歌把我在這裏作爲頂級的結果之一，它可能需要別人，所以我想我'd發佈了我在此處找到的TO問題的解決方案：http://robfelty.com/2008/03/11/convert-pdf-to-png-with-imagemagick

簡而言之：您必須告訴ImageMagick它應該掃描PDF的密度。

因此convert -density 600x600 foo.pdf foo.png會告訴ImageMagick將PDF視爲具有600dpi的分辨率，從而輸出更大的PNG。在我的情況下，由此產生的foo.png大小爲5000x6600px。您可以選擇添加-resize 3000x3000或您需要的任何尺寸，並將其縮小。

請注意，只要您的PDF文件中只有矢量圖像或文本，密度可能會根據需要設置爲高。如果PDF包含光柵化圖像，如果將其設置爲高於那些圖像的dpi，它會看起來不太好，令人驚訝！ :)

克里斯

來源

2013-01-02 09:22:06 Betagan

真棒，謝謝！這很難聽，因爲我從來沒有得到答案。爲了完整起見，這裏是我的製作單頁TIFF，規範大小，並轉換爲灰度最終配方： '轉換+毗 - 密度300×300 -depth 8調整大小3200x3200 \> in.pdf out_prefix.tif' – JStroop 2013-01-02 14:36:35

我想分享我的解決方案......它可能不適用於所有人，但由於沒有其他方法可能會幫助其他人。在我的問題中，我首先選擇了第一個選項，即使用pdfimages來獲取以每個方式旋轉的大圖像。然後我找到了一種方法來使用OCR和字數來猜測方向，這使我從（估計的）25％精確地旋轉到90％以上。

的流程如下：

使用pdfimages（apt-get的安裝poppler的-utils的），以獲得PBM的一組文件（以下未顯示）。
對於每個文件：
1. 製作了四個版本，旋轉0，90，180，270度（我稱他們爲我的代碼「北上」，「東」，「南下」和「西進」）。
2. OCR每個。字數最少的兩個可能是右側上下顛倒的版本。這在我迄今處理的一組圖像中精確度超過了99％。
3. 從字數最低的兩個字符開始，通過拼寫檢查運行OCR輸出。拼寫錯誤最少的文件（即最可識別的文字）很可能是正確的。對於我的設置，這是約93％（原爲25％），準確基於500

因人而異的樣本。我的文件是黑色和高度文本的。源圖像的長邊平均爲3300像素。我無法用灰度或顏色或帶有大量圖像的文件說話。我的大部分PDF文件都是舊影印本的糟糕掃描，因此使用更清晰的文件可能會更準確。在輪換期間使用-despeckle沒有任何區別，並且顯着減慢了速度（〜5x）。我選擇ocrad的速度和準確性，因爲我只需要粗略的數字，並拋棄了OCR。回覆：性能，我沒有什麼特別的Linux桌面機器可以運行整個腳本，每秒大約2-3個文件。

下面是一個簡單的bash腳本執行：

#!/bin/bash 
# Rotates a pbm file in place. 

# Pass a .pbm as the only arg. 
file=$1 

TMP="/tmp/rotation-calc" 
mkdir $TMP 

# Dependencies:                 
# convert: apt-get install imagemagick           
# ocrad: sudo apt-get install ocrad            
ASPELL="/usr/bin/aspell" 
AWK="/usr/bin/awk" 
BASENAME="/usr/bin/basename" 
CONVERT="/usr/bin/convert" 
DIRNAME="/usr/bin/dirname" 
HEAD="/usr/bin/head" 
OCRAD="/usr/bin/ocrad" 
SORT="/usr/bin/sort" 
WC="/usr/bin/wc" 

# Make copies in all four orientations (the src file is north; copy it to make 
# things less confusing) 
file_name=$(basename $file) 
north_file="$TMP/$file_name-north" 
east_file="$TMP/$file_name-east" 
south_file="$TMP/$file_name-south" 
west_file="$TMP/$file_name-west" 

cp $file $north_file 
$CONVERT -rotate 90 $file $east_file 
$CONVERT -rotate 180 $file $south_file 
$CONVERT -rotate 270 $file $west_file 

# OCR each (just append ".txt" to the path/name of the image) 
north_text="$north_file.txt" 
east_text="$east_file.txt" 
south_text="$south_file.txt" 
west_text="$west_file.txt" 

$OCRAD -f -F utf8 $north_file -o $north_text 
$OCRAD -f -F utf8 $east_file -o $east_text 
$OCRAD -f -F utf8 $south_file -o $south_text 
$OCRAD -f -F utf8 $west_file -o $west_text 

# Get the word count for each txt file (least 'words' == least whitespace junk 
# resulting from vertical lines of text that should be horizontal.) 
wc_table="$TMP/wc_table" 
echo "$($WC -w $north_text) $north_file" > $wc_table 
echo "$($WC -w $east_text) $east_file" >> $wc_table 
echo "$($WC -w $south_text) $south_file" >> $wc_table 
echo "$($WC -w $west_text) $west_file" >> $wc_table 

# Take the bottom two; these are likely right side up and upside down, but 
# generally too close to call beyond that. 
bottom_two_wc_table="$TMP/bottom_two_wc_table" 
$SORT -n $wc_table | $HEAD -2 > $bottom_two_wc_table 

# Spellcheck. The lowest number of misspelled words is most likely the 
# correct orientation. 
misspelled_words_table="$TMP/misspelled_words_table" 
while read record; do 
    txt=$(echo $record | $AWK '{ print $2 }') 
    misspelled_word_count=$(cat $txt | $ASPELL -l en list | wc -w) 
    echo "$misspelled_word_count $record" >> $misspelled_words_table 
done < $bottom_two_wc_table 

# Do the sort, overwrite the input file, save out the text 
winner=$($SORT -n $misspelled_words_table | $HEAD -1) 
rotated_file=$(echo $winner | $AWK '{ print $4 }') 

mv $rotated_file $file 

# Clean up. 
if [ -d $TMP ]; then 
    rm -r $TMP 
fi

來源

2012-03-19 21:31:43 JStroop

來自PDF的高分辨率圖像

回答

相關問題