根據CSV文件中的唯一/重複ID刪除/提取行

根據您的查看方式，我需要根據Id的唯一性刪除行或如果Id有重複項，則需要提取行（保留所有重複項）。而我不確定/沒有足夠的Perl知識來完成此任務。我找到了相似的主題，但沒有太多成功。這些是我使用的示例example 1,example 2和example 3。在之前的問題中，有人向我展示了List :: MoreUtils模塊的解決方案，因此我可以將值與一個通用Id合併。現在情況並非如此，如果該ID是唯一的，則這是刪除行。我知道我可以用List :: MoreUtils模塊來做到這一點，但是我想不做。這是我的虛擬數據（複製來自其他問題的示例數據，因爲數據無關緊要），在這裏你可以看到我在做什麼。訂單並不重要。根據CSV文件中的唯一/重複ID刪除/提取行

前：

Cat_id;Cat_name;Id;Name;Amount;Colour;Bla 
101;Fruits;50010;Grape;500;Red;1 
101;Fruits;50020;Strawberry;500;Red;1 
201;Vegetables;60010;Carrot;500;White;1 
101;Fruits;50060;Apple;1000;Red;1 
101;Fruits;50030;Banana;1000;Green;1 
101;Fruits;50060;Apple;500;Green;1 
101;Fruits;50020;Strawberry;1000;Red;1 
201;Vegetables;60010;Carrot;100;Purple;1 
101;Fruits;50020;Strawberry;200;Red;1

後：

Cat_id;Cat_name;Id;Name;Amount;Colour;Bla 
101;Fruits;50020;Strawberry;500;Red;1 
201;Vegetables;60010;Carrot;500;White;1 
101;Fruits;50060;Apple;1000;Red;1 
101;Fruits;50060;Apple;500;Green;1 
101;Fruits;50020;Strawberry;1000;Red;1 
201;Vegetables;60010;Carrot;100;Purple;1 
101;Fruits;50020;Strawberry;200;Red;1

你可以看到，因爲只存在一個條目，葡萄和香蕉的id爲50010和50030行已被刪除對彼此而言。

這是我的腳本，我正努力從散列中選擇唯一值並輸出它們（以Text :: CSV_XS模塊爲例）。有人可以告訴我如何做到這一點？

#!/usr/bin/perl -w 
use strict; 
use warnings; 
use Text::CSV_XS; 

my $inputfile = shift || die "Give input and output names!\n"; 
my $outputfile = shift || die "Give output name!\n"; 

open (my $infile, '<:encoding(iso-8859-1)', $inputfile) or die "Sourcefile in use/not found :$!\n"; 
open (my $outfile, '>:encoding(UTF-8)', $outputfile) or die "Outputfile in use :$!\n"; 

my $csv_in = Text::CSV_XS->new({binary => 1,sep_char => ";",auto_diag => 1,always_quote => 1,eol => $/}); 
my $csv_out = Text::CSV_XS->new({binary => 1,sep_char => "|",auto_diag => 1,always_quote => 1,eol => $/}); 

my $header = $csv_in->getline($infile); 
$csv_out->print($outfile, $header); 

my %data; 

while (my $elements = $csv_in->getline($infile)){ 
    my @columns = @{ $elements };  
    my $id = $columns[2]; 
    push @{ $data{$id} }, \@columns; 
} 

for my $id (sort keys %data){     # Sort not important 
    if @{ $data{$id} } > 1      # Here I have no idea anymore.. 
     $csv_out->print($outfile, \@columns); # 
}

來源

2015-10-13 Jan

這個問題看起來都十分熟悉。 http://stackoverflow.com/questions/28627669/merge-csv-rows-based-on-duplicate-key-and-combine-unique-values-using-perl-text/28673012#28673012 – Sobrique

@Sobrique同意，幾乎相同..我試圖從那一個工作，但那是合併字段，如果id是相同的，這是刪除行，如果id是唯一的 – Jan

而不是加載的哈希與整個數據集，我想我會繼續前進，讀取文件的兩倍，加載哈希只是你ID值。這肯定會花費更長的時間，但隨着文件的增長，在內存中可能存在所有這些數據。

這就是說，我沒有使用Text::CSV_XS，但這是一個名義想法我想到的。

my %count; 

open (my $infile, '<:encoding(iso-8859-1)', $inputfile) or die; 
open (my $outfile, '>:encoding(UTF-8)', $outputfile) or die; 

while (<$infile>) { 
    next if $. == 1; 
    my ($id) = (split /;/, $_, 4)[2]; 
    $count{$id}++; 
} 

seek $infile, 0, 0; 

while (<$infile>) { 
    my @fields = split /;/; 
    print $outfile join '|', @fields if $count{$fields[2]} > 1 or $. == 1;  
} 

close $infile; 
close $outfile;

$. == 1最後是讓你不要失去你的標題行。

- 編輯 -

#!/usr/bin/perl -w 

use strict; 
use warnings; 
use Text::CSV_XS; 

my $inputfile = shift || die "Give input and output names!\n"; 
my $outputfile = shift || die "Give output name!\n"; 

open (my $infile, '<:encoding(iso-8859-1)', $inputfile) or die; 
open (my $outfile, '>:encoding(UTF-8)', $outputfile) or die; 

my $csv_in = Text::CSV_XS->new({binary => 1,sep_char => ";", 
    auto_diag => 1,always_quote => 1,eol => $/}); 
my $csv_out = Text::CSV_XS->new({binary => 1,sep_char => "|", 
    auto_diag => 1,always_quote => 1,eol => $/}); 

my ($count, %count) = (1); 

while (my $elements = $csv_in->getline($infile)){ 
    $count{$$elements[2]}++; 
} 

seek $infile, 0, 0; 

while (my $elements = $csv_in->getline($infile)){ 
    $csv_out->print($outfile, $elements) 
    if $count{$$elements[2]} > 1 or $count++ == 1; 
} 

close $infile; 
close $outfile;

來源

2015-10-13 12:03:55 Hambone

謝謝你的回答，但我必須使用Text :: CSV_XS模塊（這是一個帶有數據分隔符的大文件）。你有關於如何對模塊做出的建議？ – Jan

我認爲你的代碼的其餘部分沒問題......我只是在懶惰地概念上描述我將如何去做。您可以採用現有的Text :: CSV_XS，完全如何。我已經修改了我的迴應。 – Hambone

感謝您的編輯，但現在它說：「在第27行的哈希元素中使用未初始化的值」，如下所示：它沒有任何可打印的內容。我錯過了什麼？ – Jan

根據CSV文件中的唯一/重複ID刪除/提取行

回答

相關問題