2010-10-14 97 views
6

我正在尋找類似HTML::TableExtract的東西,只是不是用於HTML輸入,而是用於包含使用縮進和間距格式化的「表格」的純文本輸入。如何從Perl中的文本文件中提取/解析表格數據?

數據看起來是這樣的:

Here is some header text. 

Column One  Column Two  Column Three 
a           b 
a     b      c 


Some more text 

Another Table  Another Column 
abdbdbdb   aaaa 
+0

請提供和榜樣。 – DVK 2010-10-14 04:39:37

+0

我提供了一個解決方案,但它會生成六列。你在做一個列分隔符必須> 1的空間的假設嗎? – DVK 2010-10-14 04:49:14

+0

不,但我們可以假設我知道列標題字符串,並且列數據在標題下正確對齊。 – Thilo 2010-10-14 04:51:38

回答

1

不知道任何包裝解決方案的事,但不是很靈活是相當簡單的事情假設你可以在文件做兩遍:(以下是部分Perlish的僞代碼示例所示)

  • 假設:數據可能包含空格,而不是引用ALA CSV,如果有一個空間 - 如果不是這種情況,只需使用Text::CSV(_XS)
  • 假設:沒有用於格式化的選項卡。
  • 該邏輯定義一個「列分隔符」爲任何連續的垂直行填充100%的空間。
  • 如果偶然每行有一個空格,這些空格是偏移M個字符處的數據的一部分,則邏輯將認爲偏移量M是列分隔符,因爲它無法知道更好的結果。 它可以知道更好的唯一方法是如果您需要列分隔至少X空格,其中X> 1 - 請參閱第二個代碼片段。

示例代碼:

my $INFER_FROM_N_LINES = 10; # Infer columns from this # of lines 
          # 0 means from entire file 
my $lines_scanned = 0; 
my @non_spaces=[]; 
# First pass - find which character columns in the file have all spaces and which don't 
my $fh = open(...) or die; 
while (<$fh>) { 
    last if $INFER_FROM_N_LINES && $lines_scanned++ == $INFER_FROM_N_LINES; 
    chomp; 
    my $line = $_; 
    my @chars = split(//, $line); 
    for (my $i = 0; $i < @chars; $i++) { # Probably can be done prettier via map? 
     $non_spaces[$i] = 1 if $chars[$i] ne " "; 
    } 
} 
close $fh or die; 

# Find columns, defined as consecutive "non-spaces" slices. 
my @starts, @ends; # Index at which columns start and end 
my $state = " "; # Not inside a column 
for (my $i = 0; $i < @non_spaces; $i++) { 
    next if $state eq " " && !$non_spaces[$i]; 
    next if $state eq "c" && $non_spaces[$i]; 
    if ($state eq " ") { # && $non_spaces[$i] of course => start column 
     $state = "c"; 
     push @starts, $i; 
    } else { # meaning $state eq "c" && !$non_spaces[$i] => end column 
     $state = " "; 
     push @ends, $i-1; 
    } 
} 
if ($state eq "c") { # Last char is NOT a space - produce the last column end 
    push @ends, $#non_spaces; 
} 

# Now split lines 
my $fh = open(...) or die; 
my @rows =(); 
while (<$fh>) { 
    my @columns =(); 
    push @rows, \@columns; 
    chomp; 
    my $line = $_; 
    for (my $col_num = 0; $col_num < @starts; $col_num++) { 
     $columns[$col_num] = substr($_, $starts[$col_num], $ends[$col_num]-$starts[$col_num]+1); 
    } 
} 
close $fh or die; 

現在,如果你需要柱分離至少爲X的空間,其中X> 1,這也是可行的,但柱位置的解析器需要有點更復雜:

# Find columns, defined as consecutive "non-spaces" slices separated by at least 3 spaces. 
my $min_col_separator_is_X_spaces = 3; 
my @starts, @ends; # Index at which columns start and end 
my $state = "S"; # inside a separator 
NEXT_CHAR: for (my $i = 0; $i < @non_spaces; $i++) { 
    if ($state eq "S") { # done with last column, inside a separator 
     if ($non_spaces[$i]) { # start a new column 
      $state = "c"; 
      push @starts, $i; 
     } 
     next; 
    } 
    if ($state eq "c") { # Processing a column 
     if (!$non_spaces[$i]) { # First space after non-space 
           # Could be beginning of separator? check next X chars! 
      for (my $j = $i+1; $j < @non_spaces 
          || $j < $i+$min_col_separator_is_X_spaces; $j++) { 
       if ($non_spaces[$j]) { 
        $i = $j++; # No need to re-scan again 
        next NEXT_CHAR; # OUTER loop 
       } 
       # If we reach here, next X chars are spaces! Column ended! 
       push @ends, $i-1; 
       $state = "S"; 
       $i = $i + $min_col_separator_is_X_spaces; 
      } 
     } 
     next; 
    } 
} 
1

下面是一個非常快速的解決方案,評論一個概述。基本上,如果一個「字」出現在列標題 n的開始之後,那麼它結束於列 n,除非其大部分主體落入列 n n + 1,在這種情況下,它最終會在那裏結束。整理它,擴展它以支持多個不同的表格等等,都留作練習。您也可以使用除列標題的左側偏移量以外的其他值作爲邊界標記(如中心)或由列號確定的某個值。

#!/usr/bin/perl 


use warnings; 
use strict; 


# Just plug your headers in here... 
my @headers = ('Column One', 'Column Two', 'Column Three'); 

# ...and get your results as an array of arrays of strings. 
my @result =(); 


my $all_headers = '(' . (join ').*(', @headers) . ')'; 
my $found = 0; 
my @header_positions; 
my $line = ''; 
my $row = 0; 
push @result, [] for (1 .. @headers); 


# Get lines from file until a line matching the headers is found. 

while (defined($line = <DATA>)) { 

    # Get the positions of each header within that line. 

    if ($line =~ /$all_headers/) { 
     @header_positions = @-[1 .. @headers]; 
     $found = 1; 
     last; 
    } 

} 


$found or die "Table not found! :<\n"; 


# For each subsequent nonblank line: 

while (defined($line = <DATA>)) { 
    last if $line =~ /^$/; 

    push @{$_}, "" for (@result); 
    ++$row; 

    # For each word in line: 

    while ($line =~ /(\S+)/g) { 

     my $word = $1; 
     my $position = $-[1]; 
     my $length = $+[1] - $position; 
     my $column = -1; 

     # Get column in which word starts. 

     while ($column < $#headers && 
      $position >= $header_positions[$column + 1]) { 
      ++$column; 
     } 

     # If word is not fully within that column, 
     # and more of it is in the next one, put it in the next one. 

     if (!($column == $#headers || 
      $position + $length < $header_positions[$column + 1]) && 
      $header_positions[$column + 1] - $position < 
      $position + $length - $header_positions[$column + 1]) { 

      my $element = \$result[$column + 1]->[$row]; 
      $$element .= " $word"; 

     # Otherwise, put it in the one it started in. 

     } else { 

      my $element = \$result[$column]->[$row]; 
      $$element .= " $word"; 

     } 

    } 

} 


# Output! Eight-column tabs work best for this demonstration. :P 

foreach my $i (0 .. $#headers) { 
    print $headers[$i] . ": "; 
    foreach my $c (@{$result[$i]}) { 
     print "$c\t"; 
    } 
    print "\n"; 
} 


__DATA__ 

This line ought to be ignored. 

Column One  Column Two  Column Three 
These lines are part of the tabular data to be processed. 
The data are split based on how much words overlap columns. 

This line ought to be ignored also. 

輸出示例:

 
Column One:  These lines are   The data are split 
Column Two:  part of the tabular  based on how 
Column Three: data to be processed. much words overlap columns. 
相關問題