首頁猿問基于pandas的模糊匹配刪除重復項

基于pandas的模糊匹配刪除重復項

Python

HUH函數 2021-11-16 15:45:02

我有一個包含人們信息的 DataFrame，但有重復的行，地址略有不同。如何基于模糊匹配或其他檢測相似性的方式刪除重復項，但確保只有在名字和姓氏匹配的情況下才會刪除具有相似地址的行？示例數據： First name | Last name | Address0 John Doe ABC 91 John Doe KFT 22 Michael John ABC 93 Mary Jane PEP 9/24 Mary Jane PEP, 9-25 Gary Young verylongstreetname 1 6 Gary Young 1 verylongstretname（故意在街上打錯字）示例數據的代碼：df = pd.DataFrame([ ['John', 'Doe', 'ABC 9'], ['John', 'Doe', 'KFT 2'], ['Michael', 'John', 'ABC 9'], ['Mary', 'Jane', 'PEP 9/2'], ['Mary', 'Jane', 'PEP, 9-2'], ['Gary', 'Young', 'verylongstreetname 1'], ['Gary', 'Young', '1 verylongstretname']], columns=['First name', 'Last name', 'Address'])預期輸出： First name | Last name | Address0 John Doe ABC 91 John Doe KFT 22 Michael John ABC 93 Mary Jane PEP 9/24 Gary Young verylongstreetname 1

查看完整描述

2 回答

九州編程

TA貢獻1785條經驗獲得超4個贊

用于str.replace刪除所有非單詞字符，然后drop_duplicates

df['Address'] = df['Address'].str.replace(r'\W','')

temp_address = df['Address']

df.drop_duplicates(inplace=True)

輸出

First name Last name Address

0 John Doe ABC9

1 John Doe KFT2

2 Michael John ABC9

3 Mary Jane PEP92

替換原地址

b['Address'] = b['Address'].apply(lambda x: [w for w in temp_address if w.split(' ')[0] in x][0])

輸出

First name Last name Address

0 John Doe ABC 9

1 John Doe KFT 2

2 Michael John ABC 9

3 Mary Jane PEP 9/2

好的，這是一種方法

df['Address'] = df['Address'].str.replace(r'\W',' ') # giving a space

def check_simi(d):

temp = []

flag = 0

for w in d:

temp.extend(w.split(' '))

temp = [t for t in temp if t]

flag = len(temp) / 2

if len(set(temp)) == flag:

return int(d.index[0])

else:

indexes = df.groupby(['First name','Last name'])['Address'].apply(check_simi)

indexes = [int(i) for i in indexes if i >= 0]

df.drop(indexes)

First name Last name Address

0 John Doe ABC 9

1 John Doe KFT 2

2 Michael John ABC 9

4 Mary Jane PEP 9 2

6 Gary Young 1 verylongstreetname

PS - 請查看https://github.com/seatgeek/fuzzywuzzy以獲得更清潔的方法，我沒有，因為我的網絡不允許這樣做

反對回復 2021-11-16

holdtom

TA貢獻1805條經驗獲得超10個贊

解決了。

基于@iamklaus anwser 我制作了這段代碼：

def remove_duplicates_inplace(df, groupby=[], similarity_field='', similar_level=85):

def check_simi(d):

dupl_indexes = []

for i in range(len(d.values) - 1):

for j in range(i + 1, len(d.values)):

if fuzz.token_sort_ratio(d.values[i], d.values[j]) >= similar_level:

dupl_indexes.append(d.index[j])

return dupl_indexes

indexes = df.groupby(groupby)[similarity_field].apply(check_simi)

for index_list in indexes:

df.drop(index_list, inplace=True)

remove_duplicates_inplace(df, groupby=['firstname', 'lastname'], similarity_field='address')

輸出：

firstname lastname address

0 John Doe ABC 9

1 John Doe KFT 2

2 Michael John ABC 9

3 Mary Jane PEP 9/2

5 Gary Young verylongstreetname 1

反對回復 2021-11-16

2 回答
0 關注
331 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

基于pandas的模糊匹配刪除重復項

基于pandas的模糊匹配刪除重復項

2 回答

添加回答