解析OCR文本輸出是一個挫折和角落情況下的練習。您需要實際實現解析器來處理可能遇到的不同類型的數據。沒有好的方法可以知道你的解析器是正確的,因爲將來可能會出現更多的邊緣案例,這使得解決方案易碎和潛在Buggy。
了這番解釋的方式進行,這裏是你可以進行的一種方式:
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Dumper;
$Data::Dumper::Sortkeys = 1;
my @fields;
my @extra_descriptions;
my @results;
open my $fh, "<input" or die "Unable to open 'input' : $!";
while(<$fh>) {
my @data;
chomp(); # Remove newline
s|^\s+||; # Remove leading spaces
s|\s+$||; # Remove trailing spaces
next unless m|\w|; # Skip empty lines
@data = split/\s\s+/; # Split on 2 or more spaces
# Parse Header
if ($. == 1) {
@fields = @data;
next;
}
if (1 == scalar @data) {
# Extra Size Description
push @extra_descriptions, shift @data;
next;
} elsif (4 == scalar @data or 5 == scalar @data) {
my $sn = shift @data;
my $desc = '';
# Deal with possibly missing Size info
if (4 == scalar @data) {
my $size = shift @data;
$desc = join(', ', $size, @extra_descriptions);
} else {
# 3 columns, so missing Size info
# Reverse because now main description is last
$desc = join(', ', reverse @extra_descriptions);
}
unshift(@data, $sn, $desc);
# Data should be 5 columns
(5 == scalar @data) or die "Something went wrong with data: " . join("\n",@data);
# Size (description) should be column 1 (second column)
$data[1] =~ m|[FWx]| or die "Could not figure out size! $data[1]";
my %row;
my @field_names = qw(serialno size quantity unit_price total_price);
for my $i (0 .. $#field_names) {
my $name = $field_names[$i];
my $desc = $name . "_desc";
$row{$name} = $data[$i];
$row{$desc} = $fields[$i];
}
# TODO: Insert data into database here
print Dumper(\%row);
# Reset
undef @extra_descriptions;
} else {
# Not 1, 4 or 5 columns
die "Do not know what to do about this row: '$_'";
}
}
輸出
$VAR1 = {
'quantity' => '1 SET',
'quantity_desc' => 'QTY',
'serialno' => '01',
'serialno_desc' => 'S/NO',
'size' => 'FW-50(S) (5 x 5 x 2 MH)',
'size_desc' => 'INSULATED TANK SIZE',
'total_price' => '131,592.00',
'total_price_desc' => 'TOTAL PRICE (Qr.)',
'unit_price' => '131,592.00',
'unit_price_desc' => 'U.PRICE(Qr.)'
};
$VAR1 = {
'quantity' => '1 SET',
'quantity_desc' => 'QTY',
'serialno' => '02',
'serialno_desc' => 'S/NO',
'size' => 'FW-120(S) (10 x 6 x 2 MH) w/p, w/p(3+2)',
'size_desc' => 'INSULATED TANK SIZE',
'total_price' => '252,330.00',
'total_price_desc' => 'TOTAL PRICE (Qr.)',
'unit_price' => '252,330.00',
'unit_price_desc' => 'U.PRICE(Qr.)'
};
$VAR1 = {
'quantity' => '1 SET',
'quantity_desc' => 'QTY',
'serialno' => '03',
'serialno_desc' => 'S/NO',
'size' => 'FW-2(S) (1 x 2 x 1 MH) w/p (1+1), (5+5)',
'size_desc' => 'INSULATED TANK SIZE',
'total_price' => '14,471.00',
'total_price_desc' => 'TOTAL PRICE (Qr.)',
'unit_price' => '14,471.00',
'unit_price_desc' => 'U.PRICE(Qr.)'
};
歡迎堆棧溢出。我們已經準備好幫助你,但不幸的是我們無法完成你的工作。你有沒有嘗試過任何東西在PHP或Perl – ssr1012
這是csv文件(製表符分隔)或只有空格分隔列? – ssr1012
是的,我嘗試了一些正則表達式從PDF中讀取這些數據,但是我無法從列的INSULATED TANK SIZE中讀取數據,它是空格分隔的列。 –