2014-10-03 71 views
-1

我有一個凌亂的清單(大約10K)清理,我是在Python中使用正則表達式來實現這一目標的一些問題。這裏是我的名單中的一小:在Python中清理凌亂的字符串

product_pool=["#101 BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA", 
       "#W65066CS - Cell phone, Triangle wand & 5 sections lip gloss", 
       "(Archived)S.O.S. Steel Wool Soap Pads", 
       "(ARCHIVED) HTH Spa pH Increaser", 
       "****GLUE STICKS", 
       "-20°F Splash Windshield Washer Fluid", 
       "01127 â€「 Fing’rs Mighty Drop, 3g", 
       "10-01130-Brush On Nail Glue (Three Bond TB1743)", 
       "Aveeno® Continuous Protection Sunblock Spray Products"] 

理想情況下,我想刪除像#, *, ®, â€「, °F符號,括號(Archived), (Three Bond TB1743)101, 10-01130-, 01127數字,和世界。並且最終輸出看起來像

product_pool=["BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA", 
       "Cell phone, Triangle wand 5 sections lip gloss", 
       "S.O.S. Steel Wool Soap Pads", 
       "HTH Spa pH Increaser", 
       "GLUE STICKS", 
       "Splash Windshield Washer Fluid", 
       "Fing'rs Mighty Drop", 
       "Brush On Nail Glue", 
       "Aveeno Continuous Protection Sunblock Spray Products"] 

我的方法是將產品按照我不想保留的符號拆分,然後保留所有字母。但是這種方法看起來不太合適。所以我感謝任何建議!

for product in product_pool: 
    product_split=re.split(' |, |[) |* |-]', product) 
    print ' '.join(ch for ch in product_split if ch.isalpha()) 

,輸出看:

BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA 
Cell phone Triangle wand sections lip gloss 
Steel Wool Soap Pads (S.O.S. is missing) 
HTH Spa pH Increaser 
GLUE STICKS 
Splash Windshield Washer Fluid 
Mighty Drop (Fing'rs is missing) 
Brush On Nail Glue Bond 
Continuous Protection Sunblock Spray Products (Aveeno is missing) 

回答

3
product_pool=["#101 BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA", 
       "#W65066CS - Cell phone, Triangle wand & 5 sections lip gloss", 
       "(Archived)S.O.S. Steel Wool Soap Pads", 
       "(ARCHIVED) HTH Spa pH Increaser", 
       "****GLUE STICKS", 
       "-20°F Splash Windshield Washer Fluid", 
       "01127 â€「 Fing’rs Mighty Drop, 3g", 
       "10-01130-Brush On Nail Glue (Three Bond TB1743)", 
       "Aveeno® Continuous Protection Sunblock Spray Products"] 

還有一些額外的空間,但是這可能是去了解它的一種方式。

import string 
goodChars = string.ascii_letters + '.' + ' ' 
cleaned = [''.join(i for i in word if i in goodChars) for word in product_pool] 

>>> cleaned 
[' BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA', 
'WCS Cell phone Triangle wand sections lip gloss', 
'ArchivedS.O.S. Steel Wool Soap Pads', 
'ARCHIVED HTH Spa pH Increaser', 
'GLUE STICKS', 
'F Splash Windshield Washer Fluid', 
' Fingrs Mighty Drop g', 
'Brush On Nail Glue Three Bond TB', 
'Aveeno Continuous Protection Sunblock Spray Products'] 

可以玩弄你想保留,請string constants的東西像string.punctuationstring.ascii_letters什麼人物等

+0

感謝您的想法。看起來這個清理需要多個步驟。將檢查'string.ascii_letters +'。' +''' – 2014-10-04 00:25:45

1

你可以使用regex替代與re.sub

import re 

pattern = '[^a-zA-Z\s]|(?i)archived' 
results = [re.sub(pattern, '', s).strip() for s in product_pool] 
# ['BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA', 
# 'WCS Cell phone Triangle wand sections lip gloss', 
# 'SOS Steel Wool Soap Pads', 
# 'HTH Spa pH Increaser', 
# 'GLUE STICKS', 
# 'F Splash Windshield Washer Fluid', 
# 'Fingrs Mighty Drop g', 
# 'Brush On Nail Glue Three Bond TB', 
# 'Aveeno Continuous Protection Sunblock Spray Products'] 

[^...]匹配任何不在...的正則表達式。然後,您可以使用re.sub將所有這些匹配替換爲空字符串,從而有效地刪除它們。該模式的第二項與archived塊相匹配,而(?i)告訴它忽略這些塊。