0
目的FuzzyWuzzy - 遍歷目錄,比賽接受的值,並返回一個數據幀
- 給定一個Excel文件(全錯別字),使用FuzzyWuzzy反對的
accepted
列表比較和 匹配的錯別字。 - 更正最接近的
accepted
匹配的錯字填充excel文件。
APPROACH
- 導入Excel與大熊貓
- 按原來,錯字填充的Excel文件轉換成數據幀文件
- 創建
accepted
數據幀 - 比較錯字數據幀與
accepted
數據幀使用FuzzyWuzzy - 返回原來的拼寫,接受拼寫和匹配分數
- 追加相關的,接受的拼寫原來的Excel文件/行所有拼寫
CODE
#Load Excel File into dataframe
xl = pd.read_excel(open("/../data/expenses.xlsx",'rb'))
#Let's clarify how many similar categories exist...
q = """
SELECT DISTINCT Expense
FROM xl
ORDER BY Expense ASC
"""
expenses = sqldf(q)
print(expenses)
#Let's add some acceptable categories and use fuzzywuzzy to match
accepted = ['Severance', 'Legal Fees', 'Import & Export Fees', 'I.T. Fees', 'Board Fees', 'Acquisition Fees']
#select from the list of accepted values and return the closest match
process.extractOne("Company Acquired",accepted,scorer=fuzz.token_set_ratio)
(習得費',38) 不是高得分,但足夠高,使得它返回預期的輸出
!!!!! ISSUE !!!! !
#Time to loop through all the expenses and use FuzzyWuzzy to generate and return the closest matches.
def correct_expense(expense):
for expense in expenses:
return expense, process.extractOne(expense,accepted,scorer = fuzz.token_set_ratio)
correct_expense(expenses)
( '費用'( '法律費用',47))
質詢
- 正如你所看到的,process.extractOne正確運行當在個案的基礎上進行測試時。但是,在循環中運行時,返回的值是意外的。我相信我可能會抓住第一個或最後一個專欄,但即使是這樣,我也會期待「董事費用」或「收購」彈出(請參閱原始Excel文件)。
'correct_expense()'中至少有兩個問題:你在循環內部返回,並且參數名稱與循環變量相同。 –