2 回答

TA貢獻1790條經驗 獲得超9個贊
方法1:使用pivot_tablewith join(矢量化解決方案)
我們可以pivot將您的quantity值每shopid, productid列,然后將它們連接回您的原始數據框。這應該比你的要快,forloops因為這是一種矢量化方法:
piv = df.pivot_table(index=['ShopID', 'ProductID'], columns=['ShopID', 'ProductID'], values='Quantity')
piv2 = piv.ffill().bfill()
piv3 = piv2.mask(piv2.eq(piv))
final = df.set_index(['ShopID', 'ProductID']).join(piv3).reset_index()
輸出
ShopID ProductID dt Quantity (1, 1) (1, 2) (2, 1) (2, 2)
0 1 1 2019-10-01 3 NaN 4.0 5.0 6.0
1 1 1 2019-10-02 3 NaN 4.0 5.0 6.0
2 1 2 2019-10-01 4 3.0 NaN 5.0 6.0
3 1 2 2019-10-02 4 3.0 NaN 5.0 6.0
4 2 1 2019-10-01 5 3.0 4.0 NaN 6.0
5 2 1 2019-10-02 5 3.0 4.0 NaN 6.0
6 2 2 2019-10-01 6 3.0 4.0 5.0 NaN
7 2 2 2019-10-02 6 3.0 4.0 5.0 NaN
方法 2,使用GroupBy, mask, 其中:
GroupBy我們可以使用and mask+where而不是 two來加速你的代碼for-loops:
groups = df.groupby(['ShopID', 'ProductID'])
for grp, data in groups:
m = df['ShopID'].eq(grp[0]) & df['ProductID'].eq(grp[1])
values = df['Quantity'].where(m).ffill().bfill()
df[f'Quantity_{grp[0]}_{grp[1]}'] = values.mask(m)
輸出
dt ShopID ProductID Quantity Quantity_1_1 Quantity_1_2 Quantity_2_1 Quantity_2_2
0 2019-10-01 1 1 3 NaN 4.0 5.0 6.0
1 2019-10-02 1 1 3 NaN 4.0 5.0 6.0
2 2019-10-01 1 2 4 3.0 NaN 5.0 6.0
3 2019-10-02 1 2 4 3.0 NaN 5.0 6.0
4 2019-10-01 2 1 5 3.0 4.0 NaN 6.0
5 2019-10-02 2 1 5 3.0 4.0 NaN 6.0
6 2019-10-01 2 2 6 3.0 4.0 5.0 NaN
7 2019-10-02 2 2 6 3.0 4.0 5.0

TA貢獻1831條經驗 獲得超10個贊
這是一個額外的問題pivot:merge
# somehow merge only works with pandas datetime
df['Date'] = pd.to_datetime(df['Date'])
# define the new column names
df['new_col'] = 'Quantity_'+df['ShopID'].astype(str) + '_' + df['ProductID'].astype(str)
# new data to merge:
pivot = df.pivot_table(index='Date',
columns='new_col',
values='Quantity')
# merge
new_df = df.merge(pivot, left_on='Date', right_index=True)
# mask
mask = new_df['new_col'].values[:,None] == pivot.columns.values
# adding the None the values:
new_df[pivot.columns] = new_df[pivot.columns].mask(mask)
輸出:
Date ShopID ProductID Quantity new_col Quantity_1_1 Quantity_1_2 Quantity_2_1 Quantity_2_2
-- ------------------- -------- ----------- ---------- ------------ -------------- -------------- -------------- --------------
0 2019-10-01 00:00:00 1 1 3 Quantity_1_1 nan 4 5 6
1 2019-10-02 00:00:00 1 1 3 Quantity_1_1 nan 4 5 6
2 2019-10-01 00:00:00 1 2 4 Quantity_1_2 3 nan 5 6
3 2019-10-02 00:00:00 1 2 4 Quantity_1_2 3 nan 5 6
4 2019-10-01 00:00:00 2 1 5 Quantity_2_1 3 4 nan 6
5 2019-10-02 00:00:00 2 1 5 Quantity_2_1 3 4 nan 6
6 2019-10-01 00:00:00 2 2 6 Quantity_2_2 3 4 5 nan
7 2019-10-02 00:00:00 2 2 6 Quantity_2_2 3 4 5 nan
測試與實際數據大小相似的數據:
# 3 years dates
dates = pd.date_range('2015-01-01', '2018-12-31', freq='D')
# 12 Shops and 9 products
idx = pd.MultiIndex.from_product((dates, range(1,13), range(1,10)),
names=('Date','ShopID', 'ProductID'))
# the test data
np.random.seed(1)
df = pd.DataFrame({'Quantity':np.random.randint(0,10, len(idx))},
index=idx).reset_index()
上面的代碼在 i5 筆記本電腦上大約需要 10 秒 :-)
添加回答
舉報