首頁猿問如何從列中的字符串中提取與...

如何從列中的字符串中提取與 python 列表中的另一個字符串匹配的子字符串

Python

楊__羊羊 2022-12-20 14:45:35

我有一個數據框，如下所示： col 1 col 20 59 538 Walton Avenue, Chester, FY6 7NP1 62 42 Chesterton Road, Peterborough, FR7 2NY2 179 3 Wallbridge Street, Essex, 4HG 3HT3 180 6 Stevenage Avenue, Coventry, 7PY 9NP列表類似于：[Stevenage, Essex, Coventry, Chester]按照此處的解決方案：How to check if Pandas rows contain any full string or substring of a list? 是這樣的：city_list = list(cities["name"])df["col3"] = np.where(df["col2"].str.contains('|'.join(city_list)), df["col2"], '')我發現 col 2 中的一些與列表中的字符串匹配，但 col3 與 col2 相同。我希望 col3 成為列表中的值，而不是與 col3 相同。這將是： col 1 col 2 col30 59 538 Walton Avenue, Chester, FY6 7NP Chester 1 62 42 Chesterton Road, Peterborough, FR7 2NY 2 179 3 Wallbridge Street, Essex, 4HG 3HT Essex3 180 6 Stevenage Avenue, Coventry, 7PY 9NP Coventry我試過了：pat = "|".join(cities.name)df.insert(0, "name", df["col2"].str.extract('(' + pat + ')', expand = False))但這返回了一個錯誤，說在期望 1 時有 456 個輸入。還：df["col2"] = df["col2"].apply(lambda x: difflib.get_close_matches(x, cities["name"])[0])df.merge(cities)但這返回時錯誤列表索引超出范圍。有沒有辦法做到這一點？df1 大約有 160,000 個條目，col2 中的每個地址來自不同的國家，因此沒有標準的書寫方式，而城市列表大約有 170,000 個條目

查看完整描述

4 回答

嗶嗶one

TA貢獻1854條經驗獲得超8個贊

你可以這樣做：

city_list = ["Stevenage", "Essex", "Coventry", "Chester"]

def get_match(row):

col_2 = row["col 2"].replace(",", " ").split() # Here you should process the string as you want

for c in city_list:

if difflib.get_close_matches(col_2, c)

return c

return ""

df["col 3"] = df.apply(lambda row: get_match(row), axis=1)

反對回復 2022-12-20

慕的地10843

TA貢獻1785條經驗獲得超8個贊

查看str.contains測試模式是否匹配系列的函數：

df = pd.DataFrame([[59, '538 Walton Avenue, Chester,', 'FY6 7NP'],

[62, '42 Chesterton Road, Peterborough', '4HG 3HT'],

[179, '3 Wallbridge Street, Essex', '4HG 3HT'],

[180, '6 Stevenage Avenue, Coventry', '7PY 9NP']])

city_list = ["Stevenage", "Essex", "Coventry", "Chester"]

for city in city_list:

df.loc[df[1].str.contains(city), 'match'] = city

反對回復 2022-12-20

慕慕森

TA貢獻1856條經驗獲得超17個贊

試試這個

def aux_func(address):

aux_list = ['Stevenage', 'Essex', 'Coventry', 'Chester']

# remove commas

address = address.split(',')

# avoide matches with the first part of the address

if len(address)>1:

# remove the first element of the address

address = address[1:]

for v in aux_list:

for chunk in address:

if v in chunk:

return v

return ""

df['col 3'] = [aux_func(address) for address in df['col 2']]

反對回復 2022-12-20

波斯汪

TA貢獻1811條經驗獲得超4個贊

依靠這樣的輔助功能：

df = pd.DataFrame({'col 1': [59, 62, 179, 180],

'col 2': ['538 Walton Avenue, Chester, FY6 7NP',

'42 Chesterton Road, Peterborough, FR7 2NY',

'3 Wallbridge Street, Essex, 4HG 3HT',

'6 Stevenage Avenue, Coventry, 7PY 9NP'

]})

def aux_func(x):

# split by comma and select the interesting part ([1])

x = x.split(',')

x = x[1]

aux_list = ['Stevenage', 'Essex', 'Coventry', 'Chester']

for v in aux_list:

if v in x:

return v

return ""

df['col 3'] = [aux_func(name) for name in df['col 2']]

反對回復 2022-12-20

4 回答
0 關注
130 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

如何從列中的字符串中提取與 python 列表中的另一個字符串匹配的子字符串

如何從列中的字符串中提取與 python 列表中的另一個字符串匹配的子字符串

4 回答

添加回答