解析CSV文件，找到列並記住它們

我想找出一種方法來做到這一點，我知道它應該是可能的。首先有一點背景。解析CSV文件，找到列並記住它們

我想自動創建用於將DNA序列提交給GenBank的NCBI Sequin塊的過程。我總是最終創建一個表，列出物種名稱，樣品ID值，序列類型以及最終收集的位置。我很容易將它導出到製表符分隔的文件中。現在我做這樣的事情：

while ($csv) { 
    foreach ($_) { 
    if ($_ =! m/table|species|accession/i) { 
     @csv = split('\t', $csv); 
     print NEWFILE ">[species=$csv[0]] [molecule=DNA] [moltype=genomic] [country=$csv[2]] [spec-id=$csv[1]]\n"; 
    } 
    else { 
     next; 
    } 
    } 
}

我知道這是凌亂的，我只是輸入了類似於我有記憶的東西（沒有腳本的任我在家裏的電腦，只在工作）。

現在對我罰款，因爲現在我知道哪些列我需要的信息（種類，位置和ID號）在工作。

但是，有沒有辦法（必須有）對我來說，動態找到所需信息的列？也就是說，無論列的順序如何，來自正確列的正確信息都會發送到正確的地方？

第一行通常是表格X（其中X是出版物中表格的編號），下一行通常會有感興趣的列標題，並且在標題中幾乎是通用的。幾乎所有的表格都會有標準標題來搜索，我可以使用|在我的模式匹配。

來源

2013-04-30 AlphaA

創建一個列標題映射到列數的哈希：

my %columns; 
... 

if (/table|species|accession/i) { 
    my @headings = split('\t'); 
    my $col = 0; 
    foreach my $col (@headings) { 
    $columns{"\L$col"} = $col++; 
    } 
}

然後你可以使用$csv[$columns{'species'}]。

來源

2013-04-30 02:34:20 Barmar

首先，如果我不推薦優秀的Text::CSV_XS模塊，我會失職;它在讀取CSV文件方面做得更加可靠，甚至可以處理Barmar提到的列映射方案。

也就是說，Barmar有正確的方法，儘管它完全忽略了「表格X」行是單獨的行。我建議採取一個明確的辦法，也許這樣的事情（這都將有一些詳細信息只是爲了把事情說清楚，我可能會更緊密地把它寫在生產代碼）：

# Assumes the file has been opened and that the filehandle is stored in $csv_fh. 
# Get header information first. 

my $hdr_data = {}; 

while(<$csv_fh>) { 
    if(! $hdr_data->{'table'} && /Table (\d+)/) { 
    $hdr_data->{'table'} = $1; 
    next; 
    } 
    if(! $hdr_data->{'species'} && /species/) { 
    my $n = 0; 
    # Takes the column headers as they come, creating 
    # a map between the column name and column number. 
    # Assumes that column names are case-insensitively 
    # unique. 
    my %columns = map { lc($_) => $n++ } split(/\t/); 
    # Now pick out exactly the columns we want. 
    foreach my $thingy (qw{ species accession country }) { 
     $hdr_data->{$thingy} = $columns{$thingy}; 
    } 
    last; 
    } 
} 

# Now process the rest of the lines. 

while(<$csv_fh>) { 
    my $col = split(/\t/); 
    printf NEWFILE ">[species=%s] [molecule=DNA] [moltype=genomic] [country=%s] [spec-id=%s]\n", 
    $col[$hdr_data->{'species'}], 
    $col[$hdr_data->{'country'}], 
    $col[$hdr_data->{'accession'}]; 
}

的一些變化這會讓你接近你所需要的東西。

來源

2013-04-30 03:31:24 mcglk

突出顯示％列和％{$ hdr_data}的主要原因是因爲您的標題更具靈活性。例如，'keys％{$ hdr_data}'總是會讓你知道你感興趣的列的名字， $ hdr_data - > {'bogus'}將始終返回undef，即使數據中有'假'列。把你的數據精簡到你需要的數據總是最好的。 – mcglk 2013-04-30 03:35:51

如果您需要處理引用或轉義，Text :: CSV是非常好的，但如果您確定不需要，它會過度殺傷。製表符分隔的文件通常不會使用;他們只是不允許帶有標籤的字段。 – cjm 2013-04-30 06:04:56

你能指點我一個網站或書籍，它對映射機制有很好的解釋嗎？我是一位瞭解perl的生物學家，幾乎無法完成我想要的任務，但我對這些問題缺乏深入的瞭解。我有o'reilly的學習perl，掌握perl，編程Perl，開始perl生物信息學，掌握生物信息學的perl，以及其他一些書。〜alphaa – AlphaA 2013-05-01 18:33:13

解析CSV文件，找到列並記住它們

回答

相關問題