如何刪除包含專用區域字符的行？

鑑於在私人使用區字符的文件，如：如何刪除包含專用區域字符的行？

$ cat textfile.txt | less 
10 翴 30 <U+E4D1>  ten-thirty in ... three ... two ... one . 
- 10 翴 45だи<U+E145>砆 秂 <U+E18E>  it 's a slam-dunk . 
<U+E707> 10 翴 <U+E6C4>ㄓ ?  so you will be home by 10:00 ? 
10 翴 牧 よ<U+E6BC>ㄓ<U+E5EC> bogey at 10 o'clock . 
- 10 翴 牧 よ<U+E6BC>い盠  - ten o'clock , lieutenant , 10 o'clock ! 
10 翴 牧 よ<U+E6BC>綽玭 i see it , 8 o'clock , heading south ! 
10 翴 筁<U+E5EC>  it 's past 10:00 . 
<U+E80B>ぱ 10 翴 非<U+E1A0>筁ㄓ be here tomorrow , 10:00 sharp . 
- 10 ， 老搭檔 有 人 開槍 ， 疑犯 拒捕 shots firing . suspect 's fleeing . 
- 1 -0 而已  - only 1-0 . 
- 1 -0 而已  - only 1-0 .

如何刪除一條線，如果有遇到超出Unicode字節點的任何字符？

我已經試過這樣：

# ord(u'\uE000') == 57344 
for line in open('test.txt'): 
    if any(ord(i) >57344 for i in line): 
     pass 
    else: 
     print (line)

但我似乎無法擺脫包含PUA字符的各線。

如何在unix bash中使用sed/awk或其他方法實現相同的功能而不是使用Python？

注意，我仍想保留是有效的Unicode而不是僅僅保持與ASCII字符行線。例如。我想保留最後三行的漢字，「......射擊，嫌疑人逃跑」。（出於某種原因，我無法在問題中鍵入中文部分，因爲SO顯示中文字符錯誤）。

來源

2016-11-09 alvas

請注意，我不是要刪除非英文字符。如果有任何字符落入PUA，我試圖刪除整行。我仍然想保持像' - 10，老搭檔有人開槍，疑犯拒捕射擊。嫌疑人逃跑。「＃ – alvas

您的標準（ord(i) > 57344）檢查，如果一個角色屬於private use area是不正確的：

目前，3個私人使用區域被定義爲：一個在基本多文種平面（U+E000–U+F8FF），每一個都在，a ND幾乎涵蓋，飛機15和16（U+F0000–U+FFFFD，U+100000–U+10FFFD）

這裏是固定的Python 3代碼：

pua_ranges = ((0xE000, 0xF8FF), (0xF0000, 0xFFFFD), (0x100000, 0x10FFFD)) 

def is_pua_codepoint(c): 
    return any(a <= c <= b for (a,b) in pua_ranges) 

for line in open('test.txt', 'r'): 
    if any(is_pua_codepoint(ord(i)) for i in line): 
     pass 
    else: 
     print(line)

來源

2016-11-09 14:09:40 Leon

好的答案！ – alvas

此grep命令將匹配U + E000中的任何不包含PUA字符的行， U + F8FF範圍：

grep -Pv "[\xe0\x00-\xf8\xff]"

來源

2016-11-09 07:32:47

雖然這具體需要GNU'grep'。 Mac曾經有'grep -P'但它被刪除。 – tripleee

我使用的是GNU grep 2.25，但它對我不起作用 – Leon

您可能還需要使用不同的語言環境，但我不確定。也許嘗試'LC_ALL = POSIX'和/或'LC_ALL = C'？ – tripleee

如何刪除包含專用區域字符的行？

回答

相關問題