2013-03-01 104 views
2

這是我的問題。我是西班牙語翻譯員,我有一個非常冗長的西班牙語 - 英語詞彙表文件 - 50K條目。另外,我有一個超過1K條目的停用詞彙表。我想從我打算翻譯的文本中去除這些條目。所以,我構建了一個sed腳本,它反過來從詞彙表中構建了兩個sed腳本,這些腳本完成了剝離操作,並且只留下未翻譯的文本(所以我不需要兩次解決相同的問題)。這很有效,但問題在於長文本需要很長時間,有時會長達15分鐘。這是不可避免的,還是有一種更有效的方式來做到這一點?sed語言翻譯腳本 - 提高長文本效率

這裏的主腳本:

#!/bin/sh 
before="$(date +%s)" 

#wordstxt=$(wc -w < $1) 
#mintime=$(expr "$wordstxt/200" |bc -l) 
#maxtime=$(expr "$wordstxt/175" |bc -l) 
#echo "Estimated time to process: between $mintime and $maxtime seconds." 
sed ' 
s/\,/\n/g   # strip all commas 
s/\?/\n/g  # strip question marks 
s/\*/\n/g  # strip asterisks 
s/\!/\n/g   # strip exclamation marks 
s/:/\n/g   # strip colons 
s/\-/\n/g   # strip hyphens 
s/\./\n/g   # strip periods 
s/«/\n/g   # strip left Euro-quotes 
s/»/\n/g   # strip right Euro-quotes 
s/」/\n/g   # strip slanted US quotes 
s/\"/\n/g  # strip left quotes 
s/(/\n/g   # strip left paren 
s/)/\n/g   # strip right paren 
s/\[/\n/g   # strip left bracket 
s/\]/\n/g   # strip right bracket 
s/¿/\n/g   # "¿" 
s/—/\n/g  # m-dash 
s/\ –\ /\n/g  # n-dash 
s/…/\n/g  # strip elipsis as a single character, not three periods 
s/;/\n/g   # strip semicolon 
s/[0-9]/\n/g  # strip out all numbers, replace with returns 
' $1 > $1.z.tmp 
#echo "Punctuation eliminated." 

#cp ../../Spanish\ to\ English\ projects/glossary/stoplist.txt . 
sed ' 
s/^\ //g  # strip leading spaces 
s/\ $//   # strip trailing spaces 
/^$/d   # delete blank lines 
s/\./\n/g  # strip periods 
s/\ /\\ /g  # make spaces into literals 
s/^/s\//  # begins the substitution 
s/$/\/\\n\/g/ # concludes the substitution 

1 s/^/#!\ \/bin\/sed\ \-f\n\ns\/\[0\-9\]\/\/g\ns\/\\\ \\\ \/\\\ \/g\ns\/\\\.\\\ \/\\n\/g\n\n/ 

' stoplist.txt > stoplist.sed 
chmod +x stoplist.sed 
echo "Eliminating stopwords." 
./stoplist.sed $1.z.tmp > $1.0.tmp 

sed 's/\([A-Za-z\ ]*\t\).*/\1/' SpanishGlossary.utf8 > tempgloss.2.txt 
#echo "Target phrases stripped." 

sort -u tempgloss.2.txt > tempgloss.3.txt 

awk '{ print length(), $0 | "sort -rn" }' tempgloss.3.txt > tempgloss.4.txt 
#echo "List ordered by length." 

#echo "Now creating new sed script." # THIS AFFECTS THE SED SCRIPT, NOT THE OUTPUT FILE. 

sed ' 
s/[0-9]//g  # strip out all numbers 
s/^\ //g  # strip leading spaces -- all lines have them due to the sort 
/^$/d   # delete blank lines 
s/\//\\\//g  # make text slashes into literals 
s/"/\n/g   # strip quotes 
s/\t//g   # strip tabs 
s/\./\n/g  # strip periods 
s/'\''/\\'\''/g  # make straight apostrophes into literals 
s/'\’'/\\'\’'/g  # make curly apostrophes into literals 
s/\ /\\ /g  # make spaces into literals 
/^.\{0,5\}$/d  # delete lines with less than five characters 
s/^/s\/\\b/  # begins the substitution 
s/$/\\b\/\\n\/g/ # concludes the substitution 

1 s/^/#!\ \/bin\/sed\ \-f\n\ns\/\[0\-9\]\/\/g\ns\/\\\ \\\ \/\\\ \/g\ns\/\\\.\\\ \/\\n\/g\n\n/ 

' tempgloss.4.txt > glossy.sed 

#echo "glossy.sed created." 
chmod +x glossy.sed 

echo "Eliminating existing entries. This may take a while." 
./glossy.sed $1.0.tmp > $1.1.tmp 

echo "Now cleaning up lines." 
sed -e ' 
s/\ $//   # strip trailing spaces 
s/^\ *//g  # strip any and all leading spaces 
s/\ el$//g  # strip "el" from the end 
s/\ la$//g  # strip "la" from the end 
s/\ los//g  # strip "los" from the end 
s/\ las//g  # strip "las" from the end 
s/\ o$//g  # strip "o" from the end 
s/\ y$//g  # strip "y" from the end 
s/\ $//   # strip trailing spaces (yes, again) 
' $1.1.tmp > $1.2.tmp 

echo "Creating ngrams." 
./ngrams 5 < $1.2.tmp > $1.3.tmp 2> /dev/null 

linecount="$(wc -l < $1.3.tmp)" 
#echo $linecount "lines." 
if [ "$linecount" -gt "1000" ] 
then 
    echo "Eliminating single instances." 
    sed '/^1\t/d' $1.3.tmp > $1.4.tmp 
else 
    echo "Fewer than 1000 entries, so keeping all." 
    cp $1.3.tmp $1.4.tmp 
fi 

sed -e ' 
s/[0-9]//g  # strip out all numbers 
s/^\t//g   # strip leading tab 
s/^\ *//g  # strip any and all leading spaces 
/^.\{0,7\}$/d  # delete lines with less than six characters 
s/\ $//   # strip trailing spaces (yes, again) 
#s/$/\t/   # add in the tab 
' $1.4.tmp > $1.csv 

echo "Looking for duplicates." 
sh ./dedupe $1.csv 

wordstxt=$(wc -w < $1) 
#echo $wordstxt 
wordslist=$(wc -w < $1.csv) 
#echo $wordslist 
wordspercent=$(echo "scale=4; $wordslist/$wordstxt" |bc -l) 
wordspercentage=$(echo "$wordspercent * 100" |bc -l) 


after="$(date +%s)" 
elapsed_seconds="$(expr $after - $before)" 
rate=$(echo "scale=3; $wordstxt/$elapsed_seconds" |bc -l) 
echo "Created "$1.csv", with $wordspercentage% left, in" $elapsed_seconds "seconds." #, for an effective rate of" $rate "words per second." 

rm tempgloss.*.txt 
rm *.tmp 
rm glossy.sed 
+0

有趣的問題,但我沒有時間重寫你的腳本。其他人可能會。你可以結合像s/\ el $ | \ los | \ la $ //'這樣的單詞替換。對於包含行尾標記'$'的字符串使用'/ g'可能不會花費額外的時間,但會讓其他人更難理解您的代碼。你也可以一次對許多單個字符進行分割,比如's/[,?\ *!: - \。]/\ n/g',但是使用'[character-class]'範圍會引起混淆。祝你好運。 – shellter 2013-03-02 02:10:44

+0

感謝您的提示。即使在我發佈這篇文章之後,我將標點符號從腳本的頂部拖出,並將其放入了停用詞列表中。你談論的組合有沒有什麼優勢?擁有一條超級巨大的路線,而不是成千上萬的小路線? – user1889034 2013-03-02 02:44:33

+0

是的,一條線的每次掃描花費你x次。使用包含例如5個ORed表達式(使用'|')的reg ex將時間減少到〜x/5次。我不會試圖在's/wd1 | wd2 /'行上拼寫每一個可能的單詞,你會在調試sed錯誤消息所需的時間內達到遞減的回報點。使它成爲替換組合相關的單詞,以便您的代碼更易於維護。可能還有其他一些技巧可以減少整體運行時間。有時,管道中的命令越多越好,但現在不能說。祝你好運。 – shellter 2013-03-02 02:53:11

回答

0

重寫awk的腳本,它會在幾秒鐘內運行分鐘,而不是和被更簡單,更清晰。 sed是簡單替換單行的優秀工具。對於其他任何東西,只需使用awk。

+0

我真的很喜歡這個想法,但我無法弄清楚如何將awk應用於散文。我要問這是一個單獨的問題。謝謝! – user1889034 2013-03-03 00:54:51

0

您可以組合許多這些,對於也許更快的速度

s/[\,\?\*\!:\-\.]/\n/g