2017-06-15 58 views
0

我正在清除「PERCENTAGE_AFFECTED」熊貓數據框的列。它包含整數範圍(例如:「70-80」,「70和80」,「65至70」)。檢查熊貓數據庫中的字符串是否包含子字符串並刪除

我想創建一個函數來清理所有這些以創建整數平均值。

這個作品>>>

def clean_split_range(row): 
# Initial value contains the current value for the PERCENTAGE AFFECTED column 
initial_perc = str(row['PERCENTAGE_AFFECTED']) 
chars = '<>!,?":;() ' 

#Remove chars in initial value 
if any(c in chars for c in initial_perc): 
    split_range =[] 
    cleanWord = "" 
    for char in initial_perc:    
     if char in chars: 
      char = "" 
     cleanWord += char 
    split_range.append(cleanWord) 
    initial_perc = ''.join(split_range) 



#Split initial_perc into two elements if "-" is found 
split_range = initial_perc.split('-') 
# If a "-" is found, split_date will contain a list with two items 
if len(split_range) > 1:   
    try: 
     final_perc = int(reduce(lambda x, y: x + y, list(map(int, split_range)))/(len(split_range))) 
    except ValueError: 
     split_range = split_range[0].split('+') 
     final_perc = split_range[0]    
    finally: 
     if str(final_perc).isalpha(): 
      final_perc = 0 

elif initial_perc.find('and') != -1: 
    split_other = initial_perc.split('and') 
    if len(split_other) > 1: 
     try: 
      final_perc = int(reduce(lambda x, y: x + y, list(map(int, split_other)))/(len(split_other))) 
     except ValueError: 
      split_other = split_other[0].split('+') 
      final_perc = split_other[0] 
     finally: 
      if str(final_perc).isalpha(): 
       final_perc = 0 

elif initial_perc.find('to') != -1: 
    split_other = initial_perc.split('to') 
    if len(split_other) > 1: 
     try: 
      final_perc = int(reduce(lambda x, y: x + y, list(map(int, split_other)))/(len(split_other))) 
     except ValueError: 
      split_other = split_other[0].split('+') 
      final_perc = split_other[0] 
     finally: 
      if str(final_perc).isalpha(): 
       final_perc = 0 



elif initial_perc.find('±') != -1: 
    split_other = initial_perc.split('±') 
    final_perc = split_other[0] 

elif initial_perc.startswith('over'): 
    split_other = initial_perc.split('over') 
    final_perc = split_other[1]  

elif initial_perc.find('around') != -1: 
    split_other = initial_perc.split('around') 
    final_perc = split_other[1] 



elif initial_perc.isalpha(): 
    final_perc = 0 

# If no "-" is found, split_date will just contain 1 item, the initial_date 
else: 
    final_perc = initial_perc 

return final_perc 

但是: 我試圖簡化這一因此,如果條目包含「 - 」,「和」,「到」串。我創建了我希望通過拆分和刪除子(split_list)的列表:

def new_clean_split_range(row): 
# Initial value contains the current value for the PERCENTAGE AFFECTED column 
initial_perc = str(row['PERCENTAGE_AFFECTED']) 
chars = '<>!,?":;() ' 
split_list = ['-','and'] 



# Split initial_perc into two elements if "-" is found  
if any(a in initial_perc for a in split_list): 
    for a in split_list: 
     split_range = initial_perc.split(a) 
     # If a "-" is found in split_list, initial_perc will contain a list with two items 
     if len(split_range) > 1:   
      try: 
       final_perc = int(reduce(lambda x, y: x + y, list(map(int, split_range)))/(len(split_range))) 
      except ValueError: 
       split_range = split_range[0].split('+') 
       final_perc = split_range[0]    
      finally: 
       if str(final_perc).isalpha(): 
        final_perc = 0 
     else: 
      final_perc = initial_perc 



#Remove chars in initial value 
if any(c in chars for c in initial_perc): 
    split_range =[] 
    cleanWord = "" 
    for char in initial_perc:    
     if char in chars: 
      char = "" 
     cleanWord += char 
    split_range.append(cleanWord) 
    initial_perc = ''.join(split_range) 
    split_range = ''  



elif initial_perc.find('±') != -1: 
    split_other = initial_perc.split('±') 
    final_perc = split_other[0] 

elif initial_perc.startswith('over'): 
    split_other = initial_perc.split('over') 
    final_perc = split_other[1]  

elif initial_perc.find('around') != -1: 
    split_other = initial_perc.split('around') 
    final_perc = split_other[1] 









elif initial_perc.isalpha(): 
    final_perc = 0 

# If no "-" is found, split_date will just contain 1 item, the initial_date 
else: 
    final_perc = initial_perc 

return final_perc 

任何幫助將是巨大的:)

+0

請提供的「initial_perc」和所有的輸入和預期輸出(你mantioned只是符合) – DexJ

+0

不知道如何爲你連接,但它包含整數,範圍如: 「70-80」, 「70和80「, 」65到70「,例如: 」<1「, 」12.2 + -5.2「, 「超過95」, 「大約50」 預期的輸出僅僅是適合的整數的估計值。 「12.2±5.2」可以是12.2; 「超過95」可以簡單地是95 –

+0

那麼我會建議另一種解決方案,然後你的?因爲它有點複雜和毛病 – DexJ

回答

0

我會建議使用正則表達式。

檢查了這一點。

import re 
results = re.findall(r"(\d{2,3}\.?\d*).*?(\d{2,3}\.?\d*)", x).pop() #x is input 
print results 
#results will be tuple and you can handle it easily. 

與follwoing輸入和輸出,

輸入
'70 .5894-80.9894'
'70和85' ,
'65到70' 選中,
'72 <> 75'

輸出
('70 0.5894' ,'80 0.9894 ')
(' 70' , '85')
( '65', '70')
( '72', '75')

+0

那麼如何避免類型錯誤? 我可以做一個列表理解/ for循環迭代這個正則表達式方法通過數據框列? –

+0

你的意思是我的類型錯誤,我沒有得到它?是的,你可以使用for循環這個正則表達式方法 – DexJ

相關問題