2016-11-04 45 views
2

鑑於這樣的製表符分隔文件:如何從製表符分隔的文本文件中提取唯一的單詞列表?

$ head train.txt 
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place . AT NP-TL NN-TL JJ-TL NN-TL VBD NR AT NN IN NP$ JJ NN NN VBD `` AT NN '' CS DTI NNS VBD NN . 
The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted . AT NN RBR VBD IN NN NNS CS AT NN-TL JJ-TL NN-TL , WDT HVD JJ NN IN AT NN , `` VBZ AT NN CC NNS IN AT NN-TL IN-TL NP-TL '' IN AT NN IN WDT AT NN BEDZ VBN . 
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. . AT NP NN NN HVD BEN VBN IN NP-TL JJ-TL NN-TL NN-TL NP NP TO VB NNS IN JJ `` NNS '' IN AT JJ NN WDT BEDZ VBN IN NN-TL NP NP NP . 
`` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' . `` RB AT JJ NN IN JJ NNS BEDZ VBN '' , AT NN VBD , `` IN AT JJ NN IN AT NN , AT NN IN NNS CC AT NN IN DT NN '' . 
The jury said it did find that many of Georgia's registration and election laws `` are outmoded or inadequate and often ambiguous '' . AT NN VBD PPS DOD VB CS AP IN NP$ NN CC NN NNS `` BER JJ CC JJ CC RB JJ '' . 
It recommended that Fulton legislators act `` to have these laws studied and revised to the end of modernizing and improving them '' . PPS VBD CS NP NNS VB `` TO HV DTS NNS VBN CC VBN IN AT NN IN VBG CC VBG PPO '' . 
The grand jury commented on a number of other topics , among them the Atlanta and Fulton County purchasing departments which it said `` are well operated and follow generally accepted practices which inure to the best interest of both governments '' . AT JJ NN VBD IN AT NN IN AP NNS , IN PPO AT NP CC NP-TL NN-TL VBG NNS WDT PPS VBD `` BER QL VBN CC VB RB VBN NNS WDT VB IN AT JJT NN IN ABX NNS '' . 
Merger proposed NN-HL VBN-HL 
However , the jury said it believes `` these two offices should be combined to achieve greater efficiency and reduce the cost of administration '' . WRB , AT NN VBD PPS VBZ `` DTS CD NNS MD BE VBN TO VB JJR NN CC VB AT NN IN NN '' . 
The City Purchasing Department , the jury said , `` is lacking in experienced clerical personnel as a result of city personnel policies '' . AT NN-TL VBG-TL NN-TL , AT NN VBD , `` BEZ VBG IN VBN JJ NNS CS AT NN IN NN NNS NNS '' . 

只有第一列(由製表符分隔)是很重要的,我想從第一列和輸出提取字(包括標點符號)的唯一列表到一個文件。假設單詞之間用空格分隔,即:

$ head train.txt | cut -f1 
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place . 
The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted . 
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. . 
`` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' . 
The jury said it did find that many of Georgia's registration and election laws `` are outmoded or inadequate and often ambiguous '' . 
It recommended that Fulton legislators act `` to have these laws studied and revised to the end of modernizing and improving them '' . 
The grand jury commented on a number of other topics , among them the Atlanta and Fulton County purchasing departments which it said `` are well operated and follow generally accepted practices which inure to the best interest of both governments '' . 
Merger proposed 
However , the jury said it believes `` these two offices should be combined to achieve greater efficiency and reduce the cost of administration '' . 
The City Purchasing Department , the jury said , `` is lacking in experienced clerical personnel as a result of city personnel policies '' . 

$ head train.txt | cut -f2 
AT NP-TL NN-TL JJ-TL NN-TL VBD NR AT NN IN NP$ JJ NN NN VBD `` AT NN '' CS DTI NNS VBD NN . 
AT NN RBR VBD IN NN NNS CS AT NN-TL JJ-TL NN-TL , WDT HVD JJ NN IN AT NN , `` VBZ AT NN CC NNS IN AT NN-TL IN-TL NP-TL '' IN AT NN IN WDT AT NN BEDZ VBN . 
AT NP NN NN HVD BEN VBN IN NP-TL JJ-TL NN-TL NN-TL NP NP TO VB NNS IN JJ `` NNS '' IN AT JJ NN WDT BEDZ VBN IN NN-TL NP NP NP . 
`` RB AT JJ NN IN JJ NNS BEDZ VBN '' , AT NN VBD , `` IN AT JJ NN IN AT NN , AT NN IN NNS CC AT NN IN DT NN '' . 
AT NN VBD PPS DOD VB CS AP IN NP$ NN CC NN NNS `` BER JJ CC JJ CC RB JJ '' . 
PPS VBD CS NP NNS VB `` TO HV DTS NNS VBN CC VBN IN AT NN IN VBG CC VBG PPO '' . 
AT JJ NN VBD IN AT NN IN AP NNS , IN PPO AT NP CC NP-TL NN-TL VBG NNS WDT PPS VBD `` BER QL VBN CC VB RB VBN NNS WDT VB IN AT JJT NN IN ABX NNS '' . 
NN-HL VBN-HL 
WRB , AT NN VBD PPS VBZ `` DTS CD NNS MD BE VBN TO VB JJR NN CC VB AT NN IN NN '' . 
AT NN-TL VBG-TL NN-TL , AT NN VBD , `` BEZ VBG IN VBN JJ NNS CS AT NN IN NN NNS NNS '' . 

我可以這樣做:

$ python 
>>> fout = open('word.dict', 'w')                                  
>>> fout.write('\n'.join(list(set(zip(*[line.split('\t')[0].lower().split() for line in open('train.txt')])[0])))) 
>>> exit() 
$ head word.dict 
trenton 
brevet 
secondly 
fig. 
magnetic 
doubts 
monte 
elisabeth 
four 
facilities 

但是,有沒有辦法提取殼/ bash中相同的單詞列表?

+1

我很困惑你的意思是「第一列」。你想要每一個字嗎?或者每個換行符的第一個字?或者我不瞭解的其他東西? –

回答

3

試試這個:

cut -f1 file | tr -s '[:space:]' '\n' | tr '[:upper:]' '[:lower:]' | sort -u 
  • cut -f1提取1日製表符分隔欄

  • tr -s '[:space:]' '\n'替換空白的每次運行一個換行符,有效地創造的話,每個列表在自己的路線上。

  • tr '[:upper:]' '[:lower:]'將行轉換爲全小寫。

  • sort -u對產生的單詞進行排序,省略重複項(-u)。

+0

小寫字母丟失=) – alvas

+1

實際上'sort -u -f'也是相當不錯的,但這樣,我們不知道哪一個被選中。謝謝@ mklement0 – alvas

+1

@alvas:好點。低位的要求被埋在你的Python代碼中,但是 - 我建議你也用散文加上它。 – mklement0

3

我說不出哪裏的標籤是在你發佈的樣品輸入,所以這是未經測試,但應該做你想要什麼:

awk '{sub(/\t.*/,""); for (i=1; i<=NF; i++) if (!seen[tolower($i)]++) print $i}' file 

,或者如果你想在較低的情況下,所有的輸出:

awk '{sub(/\t.*/,""); $0=tolower($0); for (i=1; i<=NF; i++) if (!seen[$i]++) print $i}' file