支持Unicode的字符串（1）程序

有沒有人可以獲得支持unicode的字符串程序的代碼示例？編程語言並不重要。我想要的東西基本上與unix命令「字符串」具有相同的功能，但是它也可以在unicode文本（UTF-16或UTF-8）上運行，從而拉動英語字符和標點符號。（我只關心英文字符，而不是任何其他字母表）。支持Unicode的字符串（1）程序

謝謝！

來源

2009-02-23 Evan

對於只有英文和UTF-8，字符串（1）應該已經OK。 – mouviciel 2009-02-23 16:01:23

如果語言不重要，那麼爲什麼不檢查字符串實用程序本身的來源？ – 2009-02-23 16:06:36

你只是想使用它，或者你出於某種原因堅持代碼？

在我的Debian系統上，看起來strings命令可以開箱即用。請參閱手冊頁中的exercept：

--encoding=encoding 
     Select the character encoding of the strings that are to be found. Possible values for encoding are: s = single-7-bit-byte characters (ASCII, ISO 8859, 
     etc., default), S = single-8-bit-byte characters, b = 16-bit bigendian, l = 16-bit littleendian, B = 32-bit bigendian, L = 32-bit littleendian. Useful 
     for finding wide character strings.

編輯：確定。我不知道C＃，所以這可能有點多毛，但基本上，您需要搜索交替零和英文字符的序列。

byte b; 
int i=0; 
while(!endOfInput()) { 
    b=getNextByte(); 
LoopBegin: 
    if(!isEnglish(b)) { 
    if(i>0) // report successful match of length i 
    i=0; 
    continue; 
    } 
    if(endOfInput()) break; 
    if((b=getNextByte())!=0) 
    goto LoopBegin; 
    i++; // found another character 
}

這應該適用於小端。

來源

2009-02-23 16:02:16 jpalecek

我有一個類似的問題，並嘗試了「strings -e ...」，但我剛剛找到修復寬度字符編碼的選項。（UTF-8編碼是可變寬度）。

記住默認字符外ascii需要額外的strings選項。這包括幾乎所有非英文字符串。

儘管如此，「-e S」（單個8位字符）輸出包含UTF-8字符。

我寫了一個非常簡單的（見解）Perl腳本，它將「strings -e S ... | iconv ...」應用於輸入文件。

我相信很容易調整它的具體限制。用法：utf8strings [options] file*

#!/usr/bin/perl -s 

our ($all,$windows,$enc); ## use -all ignore the "3 letters word" restriction 
use strict; 
use utf8::all; 

$enc = "ms-ansi" if  $windows; ## 
$enc = "utf8" unless $enc ; ## defaul encoding=utf8 
my $iconv = "iconv -c -f $enc -t utf8 |"; 

for (@ARGV){ s/(.*)/strings -e S '$1'| $iconv/;} 

my $word=qr/[a-zçáéíóúâêôàèìòùüãõ]{3}/i; # adapt this to your case 

while(<>){ 
    # next if /regular expressions for common garbage/; 
    print if ($all or /$word/); 
}

在某些情況下，這種做法產生了一些額外的垃圾。

來源

2014-02-18 12:17:56 JJoao

支持Unicode的字符串（1）程序

回答

相關問題