我有一個凌亂的清單(大約10K)清理,我是在Python中使用正則表達式來實現這一目標的一些問題。這裏是我的名單中的一小:在Python中清理凌亂的字符串
product_pool=["#101 BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA",
"#W65066CS - Cell phone, Triangle wand & 5 sections lip gloss",
"(Archived)S.O.S. Steel Wool Soap Pads",
"(ARCHIVED) HTH Spa pH Increaser",
"****GLUE STICKS",
"-20°F Splash Windshield Washer Fluid",
"01127 â€「 Fing’rs Mighty Drop, 3g",
"10-01130-Brush On Nail Glue (Three Bond TB1743)",
"Aveeno® Continuous Protection Sunblock Spray Products"]
理想情況下,我想刪除像#, *, ®, â€「, °F
符號,括號(Archived), (Three Bond TB1743)
像101, 10-01130-, 01127
數字,和世界。並且最終輸出看起來像
product_pool=["BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA",
"Cell phone, Triangle wand 5 sections lip gloss",
"S.O.S. Steel Wool Soap Pads",
"HTH Spa pH Increaser",
"GLUE STICKS",
"Splash Windshield Washer Fluid",
"Fing'rs Mighty Drop",
"Brush On Nail Glue",
"Aveeno Continuous Protection Sunblock Spray Products"]
我的方法是將產品按照我不想保留的符號拆分,然後保留所有字母。但是這種方法看起來不太合適。所以我感謝任何建議!
for product in product_pool:
product_split=re.split(' |, |[) |* |-]', product)
print ' '.join(ch for ch in product_split if ch.isalpha())
,輸出看:
BUMP STOPPER RAZOR BUMP TREATMENT SENSITIVE SKIN FORMULA
Cell phone Triangle wand sections lip gloss
Steel Wool Soap Pads (S.O.S. is missing)
HTH Spa pH Increaser
GLUE STICKS
Splash Windshield Washer Fluid
Mighty Drop (Fing'rs is missing)
Brush On Nail Glue Bond
Continuous Protection Sunblock Spray Products (Aveeno is missing)
感謝您的想法。看起來這個清理需要多個步驟。將檢查'string.ascii_letters +'。' +''' – 2014-10-04 00:25:45