首頁猿問在熊貓數據框中查找鄰居

在熊貓數據框中查找鄰居

Python

ITMISS 2022-12-20 15:16:06

我有一個自行車數據集，其中有商店的列、它們的銷售地點以及有關自行車型號的一些信息。我需要比較每個商店中模型的銷售數量。為此，我需要執行以下操作：按商店分組自行車：groups = df.groupby('store_id')然后，對于該商店中的每個模型，我需要找到具有相似特征的模型。即相似的身高、長度、體重等。為此，我設置了 10% 的差異界限，這意味著如果兩個模型之間的體重差異小于 10%，則另一個模型是可比較的鄰居。最后，對于每個模型，我想看看它在競爭對手中的排名，如果它的表現優于其中的 50%，就給它貼上“最暢銷”的標簽。問題是，我不知道如何執行第 2 步和第 3 步。有人有想法嗎？我看過 pandas 文檔中的 Groupby.Transform，但我不知道它如何適合整個畫面。非常感謝您的幫助！

查看完整描述

1 回答

侃侃無極

TA貢獻2051條經驗獲得超10個贊

試試這個：

import pandas as pd

import numpy as np

def sales_rank(x, df):

df_ns = df.set_index('id')

df_ns = df_ns.loc[x.neighbors, 'sales']

df_ns.sort_values(ascending=False, inplace=True)

df_ns = df_ns.reset_index()

return df_ns[df_ns.id == x.id].index[0]

df = pd.DataFrame(data={'id': range(5), 'weight': [20, 21, 23, 43, 22], 'sales':[200, 100, 140, 100, 100]})

df['neighbors'] = df.weight.apply(lambda x: df.id[np.isclose(df.weight.values, x, rtol=0.10)].values)

df['sales_rank_in_neighborhood'] = df.apply(lambda x: sales_rank(x, df) , axis=1)

df['top_seller'] = df.apply(lambda x: x.sales_rank_in_neighborhood < len(x.neighbors)//2, axis=1)

print(df)

輸出

id weight sales neighbors sales_rank_in_neighborhood top_seller

0 0 20 200 [0, 1, 4] 0 True

1 1 21 100 [0, 1, 2, 4] 3 False

2 2 23 140 [1, 2, 4] 0 True

3 3 43 100 [3] 0 False

4 4 22 100 [0, 1, 2, 4] 2 False

請注意，單元素社區中沒有暢銷商品。調整規則以適合您的目的。

我希望這有幫助！

編輯

我添加了一個組解決方案，多個定義鄰域的規則和固定銷售排名實現：

import pandas as pd

import numpy as np

def ns(x, df):

weight_rule = np.isclose(df.weight.values, x.weight, rtol=0.10)

gear_rule = df.gear == x.gear

type_rule = df.type == x.type

return df.id[np.logical_and.reduce((weight_rule, gear_rule, type_rule))].values

def sales_rank(x, df):

df_ns = df.set_index('id')

df_ns = df_ns.loc[x.neighbors, 'sales']

df_ns.sort_values(ascending=False, inplace=True)

df_ns = df_ns.reset_index()

return df_ns[df_ns.id == x.id].index[0]

df = pd.DataFrame(data={'store_id': [0, 1, 0, 1, 0], 'id': range(5), 'weight': [20, 21, 23, 43, 22], 'gear': [3, 3, 3, 7, 3], 'type':['mountain', 'mountain', 'mountain', 'bmx', 'mountain'], 'sales':[200, 100, 140, 100, 100]})

# Columns for results

df['neighbors'] = ''

df['sales_rank_in_neighborhood'] = ''

df['top_seller'] = ''

groups = df.groupby('store_id')

for _, g in groups:

df_temp = df.loc[g.index, :]

df_temp.neighbors = df_temp.apply(lambda x: ns(x, df_temp), axis=1)

df_temp.sales_rank_in_neighborhood = df_temp.apply(lambda x: sales_rank(x, df_temp), axis=1)

df_temp.top_seller = df_temp.apply(lambda x: x.sales_rank_in_neighborhood < len(x.neighbors)//2, axis=1)

df.loc[g.index, :] = df_temp

print(df)

輸出

store_id id weight gear type sales neighbors sales_rank_in_neighborhood top_seller

0 0 0 20 3 mountain 200 [0, 4] 0 True

1 1 1 21 3 mountain 100 [1] 0 False

2 0 2 23 3 mountain 140 [2, 4] 0 True

3 1 3 43 7 bmx 100 [3] 0 False

4 0 4 22 3 mountain 100 [0, 2, 4] 2 False

我想會有一種方法可以避免循環遍歷組，但這似乎可以解決問題。

反對回復 2022-12-20

1 回答
0 關注
107 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

在熊貓數據框中查找鄰居

在熊貓數據框中查找鄰居

1 回答

添加回答