獲取唯一值的數量

-1

我有一些帶有兩列的文本文件。第一列是氨基酸的位置，第二列是氨基酸的名稱。我想從所有文件中獲得每個氨基酸的總數。我只需要獨特的價值。在下面的例子中，LEU的total no：是2（一個來自file1，另一個來自file2）。您的建議將不勝感激！獲取唯一值的數量

file1的

54 LEU 
54 LEU 
78 VAL 
112 ALA 
78 VAL

文件2

54 LEU 
113 ALA 
113 ALA 
12 ALA 
112 ALA

期望的輸出

total no:of LEU - 2 
total no:of VAL - 1 
total no:of ALA - 4

來源

2013-04-07 user2253688

如果你只有兩個文件，只需使用awk：

awk '{ a[$2]++ } END { for (i in a) print "total no:of", i, a[i] }' <(awk '!a[$1,$2]++' file1) <(awk '!a[$1,$2]++' file2)

如果你有很多很多的文件，試試這個awk腳本。的script.awk

awk -f script.awk file{1..200}

內容：像運行

{ 
    a[FILENAME,$1,$2] 
} 

END { 
    for (i in a) { 
     split (i,x,SUBSEP) 
     b[x[3]]++ 
    } 
    for (j in b) { 
     print "total no:of", j, b[j] 
    } 
}

另外，這裏是一個班輪：

awk '{ a[FILENAME,$1,$2] } END { for (i in a) { split (i,x,SUBSEP); b[x[3]]++ } for (j in b) print "total no:of", j, b[j] }' file{1..200}

結果：

total no:of LEU 2 
total no:of ALA 4 
total no:of VAL 1

來源

2013-04-07 05:01:07 Steve

name_dict = {} 
for filename in filenames: 
    fsock = open(filename, 'r') 
    lines = fsock.readlines() 
    fsock.close() 
    for line in lines: 
     a = line.split() 
     key = a[-1] 
     if name_dict[key]: 
      name_dict[key] += 1 
     else: 
      name_dict[key] = 1 

for i in name_dict.items(): 
    print "total no:of ", i[0], " - ", i[1]

來源

2013-04-07 04:05:32

with open('file1.txt', 'r') as f1, open('file2.txt', 'r') as f2: 
    # open both files, then close afterwards 
    data = f1.readlines().split() + f2.readlines.split() 
    # read the data, then split it by spaces 
d = {elem:data.count(elem) for elem in set(data[0::2])} 
for i in d: 
    print('total no:of {} - {}'.format(i, d[i]))

來源

2013-04-07 04:05:52

打開文件，讀出的線，得到protien的名稱，如果它存在於字典中，則將其添加1，否則將其附加到字典中。

protien_dict = {} 
openfile = open(filename) 
while True: 
    line = openfile.readline() 
    if line = "": 
      break 
    values = line.split(" ") 
    if protien_dict.has_key(values[1]): 
     protien_dict[values[1]] = protien_dict[values[1]] + 1 
    else: 
     protien_dict[values[1]] = 1 
for elem in protien_dict: 
    print "total no. of " + elem + " = " + protien_dict[elem]

來源

2013-04-07 04:11:28 scottydelta

'而TRUE'：你需要一個'break'陳述或者你會得到一個無限循環。 – 2013-04-07 04:12:02

collections.Counter是特別有用的 - 你猜對了 - ！計數的東西!:

from collections import Counter 
counts = Counter() 
for filename in filenames: 
    with open(filename) as f: 
     counts.update(set(tuple(line.split()) for line in f if line.strip()))

來源

2013-04-07 04:15:46

你提到的Python，Perl和awk中。

在所有三者中，這個想法都是一樣的：使用散列來存儲值。

哈希與數組類似，除了每個條目都使用鍵索引，而不是位置。在一個散列中，只能有一個關鍵字。因此，哈希用於檢查值是否曾經出現過。下面是一個簡單的Perl例如：

my %value_hash; 
for my $value (qw(one two three one three four)) { 
    if (exists $value_hash{$value}) { 
     print "I've seen the value $value before\n"; 
    } 
    else { 
     print "The value of $value is new\n"; 
     $value_hash{$value} = 1; 
    } 
}

這將打印出：

The value of one is new 
The value of two is new 
The value of three is new 
I've seen the value of one before 
I've seen the value of three before 
The value of four is new

首先，你需要兩個循環：一個遍歷所有文件，另一個遍歷的每一行特定的文件。

for my $file_name (@file_list) { 
    open my $file_fh, "<", $file_name 
     or die qw(File $file_name doesn't exist); 
    while (my $line = <$file_fh>) { 
     chomp $line; 
     ... 
    } 
}

接下來，我們會爲每個氨基酸和那些氨基酸追蹤散列的總和介紹哈希：

use strict; 
use warnings; 
use autodie; 

my %total_amino_acids; 
my @file_list = qw(file1 file2); #Your list of files 

for my $file_name (@file_list) { 
    open my $file_fh, "<", $file_name; 
    my %seen_amino_acid_before; # "Initialize" hash which tracks seen 
    while (my $line = <$file_fh>) { 
     chomp $line; 
     my ($location, $amino_acid) = split $line; 
     if (not %seen_amino_acid_before{$amino_acid}) { 
      $total_amino_acids{$amino_acid} += 1; 
     } 
    } 
}

現在，我假設當你說獨特，你只是在談論氨基酸而不是位置。 split正在分裂這兩個值，我只看着氨基酸。如果位置也很重要，則必須將其包含在第%seen_amino_acid_before個散列的密鑰中。這是棘手的，因爲我可以想象以下內容：

54 LEU 
54 LEU 
054.00 LEU

這些是不同的字符串，但都具有相同的信息。你想要確保你標準化位置/氨基酸密鑰。

while (my $line = <$file_fh>) { 
     chomp $line; 
     my ($location, $amino_acid) = split $line; 
     my $amino_acid_key = sprinf "%04d-%s", $location, uc $amino_acid; 
     if (not %seen_amino_acid_before{$amino_acid_key}) { 
      $total_amino_acids{$amino_acid} += 1; 
     } 
    }

在上面，我創建了一個$amino_acid_key。我使用sprintf將我的數字部分格式化爲零填充小數，氨基酸爲大寫。這樣：

54 LEU 
54 leu 
054.00 Leu

都將是關鍵0054-LEU。這樣，數據輸入到文件中的方式不會影響結果。這可能是一個完全不必要的步驟，但應該始終考慮。例如，如果您的數據是由計算機生成的，那麼這可能不是問題。如果您的數據是在半夜中由一羣過度勞動的研究生輸入的，那麼您可能需要擔心格式。

現在，所有你需要的是一個循環讀取數據：我以前printf幫助格式化總計

for my $amino_acid (sort keys %total_amino_acids) { 
    printf "total no:of %4s - %4d\n", $amino_acid, $total_amino_acids{$amino_acid}; 
}

通知，所以他們會排起長隊。

來源

2013-04-07 04:42:07

另一種選擇：

use strict; 
use warnings; 

my ($argv, %hash, %seen) = ''; 

while (<>) { 
    $argv ne $ARGV and $argv = $ARGV and undef %seen; 
    !$seen{ $1 . $2 }++ and $hash{$2}++ if /(.+)\s+(.+)/; 
} 

print "total no:of $_ - $hash{$_}\n" for keys %hash;

輸出你的數據集：

total no:of ALA - 4 
total no:of VAL - 1 
total no:of LEU - 2

來源

2013-04-07 06:48:26 Kenosis

只是perl的oneliner：

perl -anE'$h{$F[1]}++}{say"total no:of $_ - $h{$_}"for keys%h'

來源

2013-04-07 07:28:27

ls file* | parallel 'sort -u {} >> tmp' ; awk '{print $2}' tmp | sort | uniq -c

這給輸出：

4 ALA

2 LEU

1 VAL

來源

2013-04-07 09:23:53

獲取唯一值的數量

回答

相關問題