2009-04-21 65 views
2

比方說,我有一個製表符分隔的文本文件,其中包含按列(帶標題)排列的數據。有解析柱狀文本的Perl模塊嗎?

有可能不同的列可以「堆疊」成「工作表」狀排列,即存在允許不同列垂直排列的一些分隔符(可能提前或可能未提前知道)。

是否有一個Perl模塊,有利於將此文本文件中的列數據解析爲數據結構(例如,鍵爲列標題,值爲列數據標量數組的哈希表)?通過「堆疊」,我的意思是一列文本可能包括多個單獨的數據「矢量」,每個數據都有不同的標題和不同的長度。無可否認,這使解析變得複雜。

編輯我老實不確定混淆的地方。儘管如此,這裏是一個例子:

header_one\theader_three 
data_1\tdata_7 
data_2\tdata_8 
data_3\tdata_9 
\tdata_10 
header_two\tdata_11 
data_4\theader_four 
data_5\tdata_12 
data_6\tdata_13 
\tdata_14 

該腳本將變成一個哈希表這個具有四個鍵:header_oneheader_twoheader_three,和header_four,每個鍵引用的數組引用指向頭部下方的data_n元件。

+0

你可能要展示一個例子......我很難形象化。 – Tanktalus 2009-04-21 22:40:28

回答

2

我認爲這是接近你在說什麼。如果列數發生變化,則將輸入視爲不同的表格。這個代碼可以很容易地修改,以識別一些其他標記(例如一行等號)而不是使用列計數。

#!/usr/bin/perl 

use strict; 
use warnings; 

use Text::CSV_XS; 

#setup the parser, here we want tab separated and we allow 
#loose quoting, so qq/foo\t"bar\tbaz"\tquux/ is 
#("foo", "bar\tbaz", "quux") 
my $p = Text::CSV_XS->new(
    { 
     sep_char   => "\t", 
     allow_loose_quotes => 1, 
    } 
); 

my @stacked; 
my $cur = 0; 
while (<>) { 
    $p->parse($_) or die $p->error_input; 
    my @rec = $p->fields; 
    #normal case, just add the record to the last 
    #section in @stacked 
    if (@rec == $cur) { 
     push @{$stacked[-1]}, \@rec; 
     next; 
    } 
    #if the number of columns don't match then 
    #we have a new section 
    push @stacked, [\@rec]; 
    $cur = @rec; #set the new number of columns 
} 

for my $table (@stacked) { 
    print "header: ", join("::", @{$table->[0]}), "\n"; 
    for my $i (1 .. $#$table) { 
     print "data: ", join("::", @{$table->[$i]}), "\n"; 
    } 
    print "\n"; 
} 
2

如果可能的話,我會從DBD::CSV開始,儘管您的「堆棧」要求(我不完全理解)可能需要一些手動解析,使用Text::CSV_XS

不要被他們的名字所愚弄 - 他們可以用任何分隔符來解析,而不僅僅是逗號。

-1

不是很順利,但我一直在做這樣的:

 my $recordType = unpack("A3", $_); 

     if ($recordType eq "APT") 
     { 
      $currentKey = parseFAAAirportAirportRecord($_); 
     } 
     elsif ($recordType eq "ATT") 
     { 
      parseFAAAirportAttendenceRecord($currentKey, $_); 
     } 
     elsif ($recordType eq "RWY") 
     { 
      parseFAAAirportRunwayRecord($currentKey, $_); 
     } 
     elsif ($recordType eq "RMK") 
     { 
      parseFAAAirportRemarkRecord($currentKey, $_); 
     } 
... 
sub parseFAAAirportAirportRecord($) 
{ 
    my ($line) = @_; 

    my ($recordType, $datasource_key, $type, $id, $effDate, $faaRegion, 
     $faaFieldOffice, $state, $stateName, $county, $countyState, 
     $city, $name, $ownershipType, $facilityUse, $ownersName, 
     $ownersAddress, $ownersCityStateZip, $ownersPhone, $facilitiesManager, 
     $managersAddress, $managersCityStateZip, $managersPhone, 
     $formattedLat, $secondsLat, $formattedLong, $secondsLong, 
     $refDetermined, $elev, $elevDetermined, $magVar, $magVarEpoch, $tph, 
     $sectional, $distFromTown, $dirFromTown, $acres, 
     $bndryARTCC, $bndryARTCCid, 
     $bndryARTCCname, $respARTCC, $respARTCCid, $respARTCCname, 
     $fssOnAirport, $fssId, $fssName, $fssPhone, $fssTollFreePhone, 
     $altFss, $altFssName, 
     $altFssPhone, $notamFacility, $notamD, $arptActDate, 
     $arptStatusCode, $arptCert, 
     $naspAgreementCode, $arptAirspcAnalysed, $aoe, $custLandRights, 
     $militaryJoint, $militaryRights, $nationalEmergency, $milUse, 
     $inspMeth, $inspAgency, $lastInsp, $lastInfo, $fuel, $airframeRepairs 
, 
     $engineRepairs, $bottledOyxgen, $bulkOxygen, 
     $lightingSchedule, $tower, $unicomFreqs, $ctafFreq, $segmentedCircle, 
     $lens, $landingFee, $isMedical, 
     $numBasedSEL, $numBasedMEL, $numBasedJet, 
     $numBasedHelo, $numBasedGliders, $numBasedMilitary, 
     $numBasedUltraLight, 
     $numScheduledOperation, $numCommuter, $numAirTaxi, 
     $numGAlocal, $numGAItinerant, 
     $numMil, $countEndingDate, 
     $aptPosSrc, $aptPosSrcDate, $aptElevSrc, $aptElevSrcDate, 
     $contractFuel, $transientStorage, $otherServices, $windIndicator, 
     $icaoId) = 
     unpack("A3 A11 A13 A4 A10 A3 A4 A2 A20 A21 A2 A40 " . 
     "A42 A2 A2 A35 A72 A45 A16 A35 A72 A45 A16 A15 A12 A15 A12 A1 A5 A1 " . 
     "A3 A4 A4 A30 A2 A3 A5 A4 A3 A30 A4 A3 A30 A1 A4 A30 A16 A16 " . 
     "A4 A30 A16 A4 " . 
     "A1 A7 A2 A15 A7 A13 A1 A1 A1 A1 A18 A6 A2 A1 A8 A8 A40 A5 A5 A8 " . 
     "A8 A9 A1 A42 A7 A4 A3 A1 A1 A3 A3 A3 A3 A3 A3 A3 " . 
     "A6 A6 A6 A6 A6 A6 A10" . 
     "A16 A10 A16 A10 A1 A12 A71 A3 A7", $line);