首頁猿問如何使用前幾行的數據在數據框列上應...

如何使用前幾行的數據在數據框列上應用函數？

Python

喵喔喔 2022-12-20 16:42:04

我有一個包含三列的 Dataframe：nums有一些要處理的值，b它始終是1or0和result當前除第一行以外的所有地方都為零的列（因為我們必須有一個初始值才能處理）。數據框如下所示： nums b result0 20.0 1 20.01 22.0 0 02 30.0 1 03 29.1 1 04 20.0 0 0...問題我想從第二行開始遍歷數據框中的每一行，進行一些計算并將結果存儲在result列中。因為我正在處理大文件，所以我需要一種方法來加快此操作，所以這就是為什么我想要類似apply.我想要做的計算是從前一行中獲取值，nums如果在當前行中，col 是然后我想（例如）添加和從前一行。例如，如果在那一行中我想減去它們。resultb0numresultb1我嘗試了什么？我嘗試使用apply，但我無法訪問前一行，遺憾的是，如果我設法訪問前一行，數據框直到最后才會更新結果列。我也嘗試過使用這樣的循環，但是對于我正在使用的大文件來說它太慢了： for i in range(1, len(df.index)): row = df.index[i] new_row = df.index[i - 1] # get index of previous row for "nums" and "result" df.loc[row, 'result'] = some_calc_func(prev_result=df.loc[new_row, 'result'], prev_num=df.loc[new_row, 'nums'], \ current_b=df.loc[row, 'b'])some_calc_func看起來像這樣（只是一個一般的例子）：def some_calc_func(prev_result, prev_num, current_b): if current_b == 1: return prev_result * prev_num / 2 else: return prev_num + 17請回答關于 some_calc_func

查看完整描述

5 回答

呼如林

TA貢獻1798條經驗獲得超3個贊

如果您想保留該功能some_calc_func而不使用其他庫，則不應嘗試在每次迭代時訪問每個元素，您可以zip在列 nums 和 b 上使用，并在您嘗試訪問前一行的 nums 和在每次迭代時將 prev_res 保存在內存中。此外，append到列表而不是數據框，并在循環后將列表分配給列。

prev_res = df.loc[0, 'result'] #get first result

l_res = [prev_res] #initialize the list of results

# loop with zip to get both values at same time,

# use loc to start b at second row but not num

for prev_num, curren_b in zip(df['nums'], df.loc[1:, 'b']):

# use your function to calculate the new prev_res

prev_res = some_calc_func (prev_res, prev_num, curren_b)

# add to the list of results

l_res.append(prev_res)

# assign to the column

df['result'] = l_res

print (df) #same result than with your method

nums b result

0 20.0 1 20.0

1 22.0 0 37.0

2 30.0 1 407.0

3 29.1 1 6105.0

4 20.0 0 46.1

現在有了 5000 行的數據框 df，我得到了：

%%timeit

prev_res = df.loc[0, 'result']

l_res = [prev_res]

for prev_num, curren_b in zip(df['nums'], df.loc[1:, 'b']):

prev_res = some_calc_func (prev_res, prev_num, curren_b)

l_res.append(prev_res)

df['result'] = l_res

# 4.42 ms ± 695 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

使用您原來的解決方案，速度慢了 ~750 倍

%%timeit

for i in range(1, len(df.index)):

row = df.index[i]

new_row = df.index[i - 1] # get index of previous row for "nums" and "result"

df.loc[row, 'result'] = some_calc_func(prev_result=df.loc[new_row, 'result'], prev_num=df.loc[new_row, 'nums'], \

current_b=df.loc[row, 'b'])

#3.25 s ± 392 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numba如果該函數some_calc_func可以很容易地與 Numba 裝飾器一起使用，則使用另一個名為的庫進行編輯。

from numba import jit

# decorate your function

@jit

def some_calc_func(prev_result, prev_num, current_b):

if current_b == 1:

return prev_result * prev_num / 2

else:

return prev_num + 17

# create a function to do your job

# numba likes numpy arrays

@jit

def with_numba(prev_res, arr_nums, arr_b):

# array for results and initialize

arr_res = np.zeros_like(arr_nums)

arr_res[0] = prev_res

# loop on the length of arr_b

for i in range(len(arr_b)):

#do the calculation and set the value in result array

prev_res = some_calc_func (prev_res, arr_nums[i], arr_b[i])

arr_res[i+1] = prev_res

return arr_res

最后，稱它為

df['result'] = with_numba(df.loc[0, 'result'],

df['nums'].to_numpy(),

df.loc[1:, 'b'].to_numpy())

使用 timeit，我的速度比使用 zip 的方法快 9 倍，而且速度會隨著大小的增加而增加

%timeit df['result'] = with_numba(df.loc[0, 'result'],

df['nums'].to_numpy(),

df.loc[1:, 'b'].to_numpy())

# 526 μs ± 45.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

請注意，根據您的實際情況，使用 Numba 可能會出現問題some_calc_func

反對回復 2022-12-20

慕田峪9158850

TA貢獻1794條經驗獲得超8個贊

IIUC：

>>> df['result'] = (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums

).fillna(df.result).cumsum()

>>> df

nums b result

0 20.0 1 20.0

1 22.0 0 42.0

2 30.0 1 12.0

3 29.1 1 -17.1

4 20.0 0 2.9

解釋：

# replace 0 with 1 and 1 with -1 in column `b` for rows where result==0

>>> df[df.result.eq(0)].b.replace({0: 1, 1: -1})

1 1

2 -1

3 -1

4 1

Name: b, dtype: int64

# multiply with nums

>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums)

0 NaN

1 22.0

2 -30.0

3 -29.1

4 20.0

dtype: float64

# fill the 'NaN' with the corresponding value from df.result (which is 20 here)

>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums).fillna(df.result)

0 20.0

1 22.0

2 -30.0

3 -29.1

4 20.0

dtype: float64

# take the cumulative sum (cumsum)

>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums).fillna(df.result).cumsum()

0 20.0

1 42.0

2 12.0

3 -17.1

4 2.9

dtype: float64

根據您在評論中的要求，我想不出沒有循環的方法：

c1, c2 = 2, 1

l = [df.loc[0, 'result']] # store the first result in a list

# then loop over the series (df.b * df.nums)

for i, val in (df.b * df.nums).iteritems():

if i: # except for 0th index

if val == 0: # (df.b * df.nums) == 0 if df.b == 0

l.append(l[-1]) # append the last result

else: # otherwise apply the rule

t = l[-1] *c2 + val * c1

l.append(t)

>>> l

[20.0, 20.0, 80.0, 138.2, 138.2]

>>> df['result'] = l

nums b result

0 20.0 1 20.0

1 22.0 0 20.0

2 30.0 1 80.0 # [ 20 * 1 + 30 * 2]

3 29.1 1 138.2 # [ 80 * 1 + 29.1 * 2]

4 20.0 0 138.2

似乎速度不夠快，沒有測試大樣本。

反對回復 2022-12-20

回首憶惘然

TA貢獻1847條經驗獲得超11個贊

您有 af(...) 可以申請，但不能申請，因為您需要保留（前一）行的記憶。您可以使用閉包或類來執行此操作。下面是一個類的實現：

import pandas as pd

class Func():

def __init__(self, value):

self._prev = value

self._init = True

def __call__(self, x):

if self._init:

res = self._prev

self._init = False

elif x.b == 0:

res = x.nums - self._prev

else:

res = x.nums + self._prev

self._prev = res

return res

#df = pd.read_clipboard()

f = Func(20)

df['result'] = df.apply(f, axis=1)

你可以用__call__你想要的任何東西替換some_calc_func身體。

反對回復 2022-12-20

守著一只汪

TA貢獻1872條經驗獲得超4個贊

我意識到這就是@Prodipta 的答案，但這種方法使用global關鍵字來記住每次迭代的先前結果apply：

prev_result = 20

def my_calc(row):

global prev_result

i = int(row.name) #the index of the current row

if i==0:

return prev_result

elif row['b'] == 1:

out = prev_result * df.loc[i-1,'nums']/2 #loc to get prev_num

else:

out = df.loc[i-1,'nums'] + 17

prev_result = out

return out

df['result'] = df.apply(my_calc, axis=1)

您的示例數據的結果：

nums b result

0 20.0 1 20.0

1 22.0 0 37.0

2 30.0 1 407.0

3 29.1 1 6105.0

4 20.0 0 46.1

這是@Ben T 的答案的速度測試 - 不是最好的但也不是最差的？

In[0]

df = pd.DataFrame({'nums':np.random.randint(0,100,5000),'b':np.random.choice([0,1],5000)})

prev_result = 20

%%timeit

df['result'] = df.apply(my_calc, axis=1)

Out[0]

117 ms ± 5.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

反對回復 2022-12-20

臨摹微笑

TA貢獻1982條經驗獲得超2個贊

重新使用你的循環和 some_calc_func

我正在使用您的循環并將其減少到最低限度，如下所示

for i in range(1, len(df)):

df.loc[i, 'result'] = some_calc_func(df.loc[i, 'b'], df.loc[i - 1, 'result'], df.loc[i, 'nums'])

并且some_calc_func實現如下

def some_calc_func(bval, prev_result, curr_num):

if bval == 0:

return prev_result + curr_num

else:

return prev_result - curr_num

結果如下

nums b result

0 20.0 1 20.0

1 22.0 0 42.0

2 30.0 1 12.0

3 29.1 1 -17.1

4 20.0 0 2.9

反對回復 2022-12-20

5 回答
0 關注
146 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

如何使用前幾行的數據在數據框列上應用函數？

如何使用前幾行的數據在數據框列上應用函數？

5 回答

添加回答

如何使用前幾行的數據在數據框列上應用函數？

如何使用前幾行的數據在數據框列上應用函數？