我目前正在過濾其構造爲這樣的詞彙:沒有產生預期的輸出文件
的Lexicon_a的片段:
<oov> <oov>
A AH0
A EY1
A''S EY1 Z
A'BODY EY1 B AA2 D IY0
A'COURT EY1 K AO2 R T
A'D EY1 D
A'GHA EY1 G AH0
A'GOIN EY1 G OY1 N
A'LL EY1 L
A'M EY1 M
A'MIGHTY EY1 M AY1 T IY0
A'MIGHTY'S EY1 M AY1 T IY0 Z
A'MOST EY1 M OW2 S T
A'N'T EY1 AH0 N T
A'PENNY EY1 P EH2 N IY0
A'READY EY1 R IY1 D IY0
A'RIGHT EY1 R AY2 T
A'RONY EY1 R OW1 N IY0
A'S EY1 Z
A'TER EY1 T ER0
A'TERNOON EY1 T ER0 N UW1 N
A'TERWARDS EY1 T ER0 W ER0 D Z
A'THEGITHER EY1 DH AH0 JH IH1 DH ER0
A'THING EY1 DH IH0 NG
A'TIM EY1 T IH2 M
A'VE AH0 V
AA AA1
要使用的非靜音手機中的文件。所以基本上是一個文件中包含所有音素的文件。音素只能出現在文件中。
我想是這樣的
cut -f 2- lexicon.txt | sed 's/ /\n/g' | sort -u > nonsilence_phones.txt
但是,這似乎給了一個有點搞砸輸出。詞和phoenemes的組合。我怎樣才能提取音素,並只出現一個。 弄亂輸出:
<oov>
A
A'S
AA1
AA2
AH0
AO2
AY1
AY2
B
D
DH
EH2
ER0
EY1
G
IH0
IH1
IH2
IY0
IY1
JH
K
L
M
N
NG
OW1
OW2
OY1
P
R
S
T
UW1
V
W
Z
詞典條目列出象這樣
word '\t' phonemes
我試圖 切-d '' -f 2- lexicon.txt | sed's// \ n/g'|排序-u> nonsilence_phones.txt
在一個不同的lexicon_b.txt
<oov> <oov>
A AH
AND AH N D
APOSTROPHE AH P AA S T R AH F IY
APRIL EY P R AH L
AREA EH R IY AH
AUGUST AA G AH S T
B B IY
C S IY
CODE K OW D
D D IY
DECEMBER D IH S EH M B ER
E IY
EIGHT EY T
EIGHTEEN EY T IY N
EIGHTEENTH EY T IY N TH
EIGHT EY T TH
EIGHTY EY T IY
ELEVEN IH L EH V AH N
ELEVENTH IH L EH V AH N TH
ENTER EH N T ER
ERASE IH R EY S
F EH F
FEBRUARY F EH B Y AH W EH R IY
FIFTEEN F IH F T IY N
FIFTEENTH F IH F T IY N TH
FIFTH F IH F TH
FIFTY F IH F T IY
FIRST F ER S T
FIVE F AY V
FORTY F AO R T IY
FOUR F AO R
FOURTEEN F AO R T IY N
FOURTH F AO R TH
G JH IY
GO G OW
H EY CH
HALF HH AE F
HELP HH EH L P
HUNDRED HH AH N D R AH D
I AY
J JH EY
JANUARY JH AE N Y UW EH R IY
JULY JH UW L AY
JUNE JH UW N
K K EY
L EH L
M EH M
MARCH M AA R CH
MAY M EY
N EH N
NINE N AY N
NINETEEN N AY N T IY N
NINETY N AY N T IY
NINTH N AY N TH
NO N OW
NOVEMBER N OW V EH M B ER
O OW
OCTOBER AA K T OW B ER
OF AH V
OH OW
ONE W AH N
P P IY
Q K Y UW
R AA R
REPEAT R IH P IY T
RUBOUT R AH B AW T
S EH S
SECOND S EH K AH N D
SEPTEMBER S EH P T EH M B ER
SEVEN S EH V AH N
SEVENTEEN S EH V AH N T IY N
SEVENTH S EH V AH N TH
SEVENTY S EH V AH N T IY
SIX S IH K S
SIXTEEN S IH K S T IY N
SIXTEENTH S IH K S T IY N TH
SIXTH S IH K S TH
SIXTY S IH K S T IY
START S T AA R T
STOP S T AA P
T T IY
TEN T EH N
THIRD TH ER D
THIRTEEN TH ER T IY N
THIRTIETH TH ER T IY AH TH
THIRTY TH ER D IY
THOUSAND TH AW Z AH N D
THREE TH R IY
TWELFTH T W EH L F TH
TWELVE T W EH L V
TWENTIETH T W EH N T IY AH TH
TWENTY T W EH N T IY
TWO T UW
U Y UW
V V IY
W D AH B AH L Y UW
X EH K S
Y W AY
YES Y EH S
Z Z IY
ZERO Z IH R OW
哪產生正確的輸出是
AA
AE
AH
AO
AW
AY
B
CH
D
EH
ER
EY
F
G
HH
IH
IY
JH
K
L
M
N
<oov>
OW
P
R
S
T
TH
UW
V
W
Y
Z
lexicon_a和lexicon_b之間唯一的區別是該單詞和音素選項卡分隔在lexicon_b中,並且它們由lexicon_a中的空格分隔。
這就是爲什麼我認爲改變定界符切是標籤就足夠了..
亂七八糟的輸出是片段的實際輸出... – bash
,但它不清楚什麼是預期的輸出...和在前面的評論中的拼寫錯誤:我的意思是'小樣本'... – Sundeep
我改變了位希望它使moe感 – bash