檢查熊貓數據庫中的字符串是否包含子字符串並刪除

我正在清除「PERCENTAGE_AFFECTED」熊貓數據框的列。它包含整數範圍（例如：「70-80」，「70和80」，「65至70」）。檢查熊貓數據庫中的字符串是否包含子字符串並刪除

我想創建一個函數來清理所有這些以創建整數平均值。

這個作品>>>

def clean_split_range(row): 
# Initial value contains the current value for the PERCENTAGE AFFECTED column 
initial_perc = str(row['PERCENTAGE_AFFECTED']) 
chars = '<>!,?":;() ' 

#Remove chars in initial value 
if any(c in chars for c in initial_perc): 
    split_range =[] 
    cleanWord = "" 
    for char in initial_perc:    
     if char in chars: 
      char = "" 
     cleanWord += char 
    split_range.append(cleanWord) 
    initial_perc = ''.join(split_range) 



#Split initial_perc into two elements if "-" is found 
split_range = initial_perc.split('-') 
# If a "-" is found, split_date will contain a list with two items 
if len(split_range) > 1:   
    try: 
     final_perc = int(reduce(lambda x, y: x + y, list(map(int, split_range)))/(len(split_range))) 
    except ValueError: 
     split_range = split_range[0].split('+') 
     final_perc = split_range[0]    
    finally: 
     if str(final_perc).isalpha(): 
      final_perc = 0 

elif initial_perc.find('and') != -1: 
    split_other = initial_perc.split('and') 
    if len(split_other) > 1: 
     try: 
      final_perc = int(reduce(lambda x, y: x + y, list(map(int, split_other)))/(len(split_other))) 
     except ValueError: 
      split_other = split_other[0].split('+') 
      final_perc = split_other[0] 
     finally: 
      if str(final_perc).isalpha(): 
       final_perc = 0 

elif initial_perc.find('to') != -1: 
    split_other = initial_perc.split('to') 
    if len(split_other) > 1: 
     try: 
      final_perc = int(reduce(lambda x, y: x + y, list(map(int, split_other)))/(len(split_other))) 
     except ValueError: 
      split_other = split_other[0].split('+') 
      final_perc = split_other[0] 
     finally: 
      if str(final_perc).isalpha(): 
       final_perc = 0 



elif initial_perc.find('±') != -1: 
    split_other = initial_perc.split('±') 
    final_perc = split_other[0] 

elif initial_perc.startswith('over'): 
    split_other = initial_perc.split('over') 
    final_perc = split_other[1]  

elif initial_perc.find('around') != -1: 
    split_other = initial_perc.split('around') 
    final_perc = split_other[1] 



elif initial_perc.isalpha(): 
    final_perc = 0 

# If no "-" is found, split_date will just contain 1 item, the initial_date 
else: 
    final_perc = initial_perc 

return final_perc

但是：我試圖簡化這一因此，如果條目包含「 - 」，「和」，「到」串。我創建了我希望通過拆分和刪除子（split_list）的列表：

def new_clean_split_range(row): 
# Initial value contains the current value for the PERCENTAGE AFFECTED column 
initial_perc = str(row['PERCENTAGE_AFFECTED']) 
chars = '<>!,?":;() ' 
split_list = ['-','and'] 



# Split initial_perc into two elements if "-" is found  
if any(a in initial_perc for a in split_list): 
    for a in split_list: 
     split_range = initial_perc.split(a) 
     # If a "-" is found in split_list, initial_perc will contain a list with two items 
     if len(split_range) > 1:   
      try: 
       final_perc = int(reduce(lambda x, y: x + y, list(map(int, split_range)))/(len(split_range))) 
      except ValueError: 
       split_range = split_range[0].split('+') 
       final_perc = split_range[0]    
      finally: 
       if str(final_perc).isalpha(): 
        final_perc = 0 
     else: 
      final_perc = initial_perc 



#Remove chars in initial value 
if any(c in chars for c in initial_perc): 
    split_range =[] 
    cleanWord = "" 
    for char in initial_perc:    
     if char in chars: 
      char = "" 
     cleanWord += char 
    split_range.append(cleanWord) 
    initial_perc = ''.join(split_range) 
    split_range = ''  



elif initial_perc.find('±') != -1: 
    split_other = initial_perc.split('±') 
    final_perc = split_other[0] 

elif initial_perc.startswith('over'): 
    split_other = initial_perc.split('over') 
    final_perc = split_other[1]  

elif initial_perc.find('around') != -1: 
    split_other = initial_perc.split('around') 
    final_perc = split_other[1] 









elif initial_perc.isalpha(): 
    final_perc = 0 

# If no "-" is found, split_date will just contain 1 item, the initial_date 
else: 
    final_perc = initial_perc 

return final_perc

任何幫助將是巨大的:)

來源

2017-06-15 Ryu Lippmann

請提供的「initial_perc」和所有的輸入和預期輸出（你mantioned只是符合） – DexJ

不知道如何爲你連接，但它包含整數，範圍如：「70-80」，「70和80「，」65到70「，例如：」<1「，」12.2 + -5.2「，「超過95」，「大約50」預期的輸出僅僅是適合的整數的估計值。「12.2±5.2」可以是12.2; 「超過95」可以簡單地是95 –

那麼我會建議另一種解決方案，然後你的？因爲它有點複雜和毛病 – DexJ

我會建議使用正則表達式。

檢查了這一點。

import re 
results = re.findall(r"(\d{2,3}\.?\d*).*?(\d{2,3}\.?\d*)", x).pop() #x is input 
print results 
#results will be tuple and you can handle it easily.

與follwoing輸入和輸出，

輸入
'70 .5894-80.9894'
'70和85' ，
'65到70' 選中，
'72 <> 75'

輸出
（'70 0.5894' ，'80 0.9894 '）
（' 70' ， '85'）
（ '65'， '70'）
（ '72'， '75'）

來源

2017-06-16 05:14:40 DexJ

那麼如何避免類型錯誤？我可以做一個列表理解/ for循環迭代這個正則表達式方法通過數據框列？ –

你的意思是我的類型錯誤，我沒有得到它？是的，你可以使用for循環這個正則表達式方法 – DexJ

檢查熊貓數據庫中的字符串是否包含子字符串並刪除

回答

相關問題