首頁猿問 Pandas：在日期時間上執行...

Pandas：在日期時間上執行 Groupby Rolling 時不保留索引

Python

撒科打諢 2023-04-25 17:21:20

我有一個數據框，其中一些日期相同。作為問題的說明，我創建了一個日期相同的示例 df。df = pd.DataFrame({"column1": range(6), "column2": range(6), 'group': 3*['A','B'], 'date':pd.date_range("20190101", periods=6)})df.loc[:,'date']=df.loc[0,'date']df# Output of DF column1 column2 group date0 0 0 A 2019-01-011 1 1 B 2019-01-012 2 2 A 2019-01-013 3 3 B 2019-01-014 4 4 A 2019-01-015 5 5 B 2019-01-01對 datetime 列執行 groupby 滾動操作時出現問題：索引未保留。當日期相同時，這是一個問題，因為無法合并回原始數據框（這是我的目標）。df.groupby('group').rolling('2D',on='date')['column1'].sum()# Output of Groupby Rollinggroup date A 2019-01-01 0.0 2019-01-01 2.0 2019-01-01 6.0B 2019-01-01 1.0 2019-01-01 4.0 2019-01-01 9.0Name: column1, dtype: float64我有一個可行的替代解決方案，但速度要慢得多。df.groupby('group').apply(lambda x: x.rolling('2D',on='date')['column1'].sum())# Output of Groupby Apply Rolling group A 0 0.0 2 2.0 4 6.0B 1 1.0 3 4.0 5 9.0Name: column1, dtype: float64希望有比上述更有效的東西。

查看完整描述

2 回答

莫回無

TA貢獻1865條經驗獲得超7個贊

對于那些感興趣的人，我創建了一個更復雜的示例 df 來測試上面提出的每個解決方案的效率。

我原來的方法（這里最慢，但如果組很少則效率高）：

%%timeit

df = pd.DataFrame({"column1": range(600),

"column2": range(600),

"column3": range(600),

"column4": range(600),

"column5": range(600),

"column6": range(600),

"column7": range(600),

"column8": range(600),

'group': 5*['l'+str(i) for i in range(120)],

'date':pd.date_range("20190101", periods=600)})

### Set the date the same

df.loc[:,'date']=df.loc[0,'date']

cols = ['column1','column2','column3','column4','column5','column6','column7','column8']

newcols = ['col1','col2','col3','col4','col5','col6','col7','col8']

if newcols[0] not in df.columns:

df = df.reindex(columns=df.columns.tolist()+newcols)

df[newcols]=df.groupby('group').apply(lambda x: x.rolling('2D',on='date')[cols].sum()

).sort_index(level=1).drop('date',axis=1).values

# timeit output

345 ms ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

大衛埃里克森的解決方案。如果有很多組且每個組中的觀察值很少，那么它是有效的。

%%timeit

df = pd.DataFrame({"column1": range(600),

"column2": range(600),

"column3": range(600),

"column4": range(600),

"column5": range(600),

"column6": range(600),

"column7": range(600),

"column8": range(600),

'group': 5*['l'+str(i) for i in range(120)],

'date':pd.date_range("20190101", periods=600)})

### Set the date the same

df.loc[:,'date']=df.loc[0,'date']

cols = ['column1','column2','column3','column4','column5','column6','column7','column8']

newcols = ['col1','col2','col3','col4','col5','col6','col7','col8']

if newcols[0] not in df.columns:

df = df.reindex(columns=df.columns.tolist()+newcols)

my_dict = {}

my_dict["index"] = "max"

my_dict.update(dict.fromkeys(cols, "sum"))

df[newcols]=df.reset_index().groupby('group').rolling('2D',

on='date').agg(my_dict).sort_values('index').drop('index',axis=1).values

# timeit output

110 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

RichieV 提出的最快的解決方案：

%%timeit

df = pd.DataFrame({"column1": range(600),

"column2": range(600),

"column3": range(600),

"column4": range(600),

"column5": range(600),

"column6": range(600),

"column7": range(600),

"column8": range(600),

'group': 5*['l'+str(i) for i in range(120)],

'date':pd.date_range("20190101", periods=600)})

### Set the date the same

df.loc[:,'date']=df.loc[0,'date']

cols = ['column1','column2','column3','column4','column5','column6','column7','column8']

newcols = ['col1','col2','col3','col4','col5','col6','col7','col8']

if newcols[0] not in df.columns:

df = df.reindex(columns=df.columns.tolist()+newcols)

df=df.sort_values(['group','date'],kind='mergesort').reset_index(drop=True)

df[newcols]=df.groupby('group').rolling('2D',on='date')[cols].sum().values

df=df.sort_values('column1',kind='mergesort').reset_index(drop=True)

# timeit output

40 ms ± 6.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

反對回復 2023-04-25

青春有我

TA貢獻1784條經驗獲得超8個贊

您可以使用.reset_index()該列，然后將該index列作為其余列的結果與.groupby和一起使用.agg。我想這會比 lambda x 快得多。

df = pd.DataFrame({"column1": range(6),

"column2": range(6),

'group': 3*['A','B'],

'date':pd.date_range("20190101", periods=6)})

df = df.reset_index().groupby('group').rolling('5D',on='date').agg({'index' : 'max', 'column1' : 'sum'}))

index column1

group date

A 2019-01-01 0.0 0.0

2019-01-03 2.0 2.0

2019-01-05 4.0 6.0

B 2019-01-02 1.0 1.0

2019-01-04 3.0 4.0

2019-01-06 5.0 9.0

從那里，如果你想要沒有日期的最終輸出格式，你可以這樣做：

df = df.reset_index().groupby(['group','index'])['column1'].sum()

group index

A 0.0 0.0

2.0 2.0

4.0 6.0

B 1.0 1.0

3.0 4.0

5.0 9.0

反對回復 2023-04-25

2 回答
0 關注
192 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

Pandas：在日期時間上執行 Groupby Rolling 時不保留索引

Pandas：在日期時間上執行 Groupby Rolling 時不保留索引

2 回答

添加回答