4 回答

TA貢獻2021條經驗 獲得超8個贊
import pandas as pd
from io import StringIO
from fuzzywuzzy import process
s = """full_name,dob
Jerry Smith,21/01/2010
Morty Smith,18/06/2008
Rick Sanchez,27/04/1993
Jery Smith,27/12/2012
Morti Smith,13/03/2012"""
df = pd.read_csv(StringIO(s))
# 1 - use fuzzywuzzy.process.extract with list comprehension
# 2 - You still have to iterate once but this method avoids the use of apply, which can be very slow
# 3 - convert the list comprehension results to a dataframe
# Note that I am limiting the results to one match. You can adjust the code as you see fit
df2 = pd.DataFrame([process.extract(df['full_name'][i], df[~df.index.isin([i])]['full_name'], limit=1)[0] for i in range(len(df))],
index=df.index, columns=['match_name', 'match_percent', 'match_index'])
# join the new dataframe to the original
final = df.join(df2)
full_name dob match_name match_percent match_index
0 Jerry Smith 21/01/2010 Jery Smith 95 3
1 Morty Smith 18/06/2008 Morti Smith 91 4
2 Rick Sanchez 27/04/1993 Morti Smith 43 4
3 Jery Smith 27/12/2012 Jerry Smith 95 0
4 Morti Smith 13/03/2012 Morty Smith 91 1

TA貢獻1827條經驗 獲得超4個贊
通常有兩個部分可以幫助您提高性能:
減少比較次數
使用更快的方式來匹配字符串
在你的實現中,你執行了很多不必要的比較,因為你總是比較 A <-> B,然后比較 B <-> A。你也比較 A <-> A,通??偸?100。所以你可以減少數量的比較超過50%。由于您只想添加分數超過 90 的匹配項,因此此信息可用于加快比較速度。
您的代碼可以通過以下方式來實現這兩個更改,這應該會快得多。在我的機器上測試時,您的代碼運行大約 12 秒,而這個改進版本只需要 1.7 秒。
import pandas as pd
from io import StringIO
from rapidfuzz import fuzz
# generate a bigger list of examples to show the performance benefits
s = "fullname,dob"
s+='''
Jerry Smith,21/01/2010
Morty Smith,18/06/2008
Rick Sanchez,27/04/1993
Jery Smith,27/12/2012
Morti Smith,13/03/2012'''*500
dataframe = pd.read_csv(StringIO(s))
# only create the data series once
full_names = dataframe['fullname']
for index, row1 in full_names.items():
? ? # skip elements that are already compared
? ? for row2 in full_names.iloc[index+1::]:
? ? ? ? # use a score_cutoff to improve the runtime for bad matches
? ? ? ? score = fuzz.ratio(row1, row2, score_cutoff=90)
? ? ? ? if score:
? ? ? ? ? ? _list.append([row1, row2, score])

TA貢獻1803條經驗 獲得超6個贊
您可以創建第一個模糊數據:
import pandas as pd
from io import StringIO
from fuzzywuzzy import fuzz
data = StringIO("""
Jerry Smith
Morty Smith
Rick Sanchez
Jery Smith
Morti Smith
""")
df = pd.read_csv(data, names=['full_name'])
for index, row in df.iterrows():
df[row['full_name']] = df['full_name'].apply(lambda x:fuzz.ratio(row['full_name'], x))
print(df.to_string())
輸出:
full_name Jerry Smith Morty Smith Rick Sanchez Jery Smith Morti Smith
0 Jerry Smith 100 73 26 95 64
1 Morty Smith 73 100 26 76 91
2 Rick Sanchez 26 26 100 27 35
3 Jery Smith 95 76 27 100 67
4 Morti Smith 64 91 35 67 100
然后找到所選名稱的最佳匹配:
data_rows = df[df['Jerry Smith'] > 90]
print(data_rows)
輸出:
full_name Jerry Smith Morty Smith Rick Sanchez Jery Smith Morti Smith
0 Jerry Smith 100 73 26 95 64
3 Jery Smith 95 76 27 100 67

TA貢獻1784條經驗 獲得超9個贊
這種比較方法會做雙重工作,因為在“Jerry Smith”和“Morti Smith”之間運行 fuzz.ratio 與在“Morti Smith”和“Jerry Smith”之間運行相同。
如果您迭代子數組,那么您將能夠更快地完成此操作。
dataframe = pd.read_csv('datafile.csv')
_list = []
for i_dataframe in range(len(dataframe)-1):
comparison_fullname = dataframe['fullname'][i_dataframe]
for entry_fullname, entry_score in process.extract(comparison_fullname, dataframe['fullname'][i_dataframe+1::], scorer=fuzz.ratio):
if entry_score >=90:
_list.append((comparison_fullname, entry_fullname, entry_score)
print(_list)
這將防止任何重復工作。
添加回答
舉報