5 回答

TA貢獻1798條經驗 獲得超3個贊
如果您想保留該功能some_calc_func而不使用其他庫,則不應嘗試在每次迭代時訪問每個元素,您可以zip在列 nums 和 b 上使用,并在您嘗試訪問前一行的 nums 和在每次迭代時將 prev_res 保存在內存中。此外,append到列表而不是數據框,并在循環后將列表分配給列。
prev_res = df.loc[0, 'result'] #get first result
l_res = [prev_res] #initialize the list of results
# loop with zip to get both values at same time,
# use loc to start b at second row but not num
for prev_num, curren_b in zip(df['nums'], df.loc[1:, 'b']):
# use your function to calculate the new prev_res
prev_res = some_calc_func (prev_res, prev_num, curren_b)
# add to the list of results
l_res.append(prev_res)
# assign to the column
df['result'] = l_res
print (df) #same result than with your method
nums b result
0 20.0 1 20.0
1 22.0 0 37.0
2 30.0 1 407.0
3 29.1 1 6105.0
4 20.0 0 46.1
現在有了 5000 行的數據框 df,我得到了:
%%timeit
prev_res = df.loc[0, 'result']
l_res = [prev_res]
for prev_num, curren_b in zip(df['nums'], df.loc[1:, 'b']):
prev_res = some_calc_func (prev_res, prev_num, curren_b)
l_res.append(prev_res)
df['result'] = l_res
# 4.42 ms ± 695 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
使用您原來的解決方案,速度慢了 ~750 倍
%%timeit
for i in range(1, len(df.index)):
row = df.index[i]
new_row = df.index[i - 1] # get index of previous row for "nums" and "result"
df.loc[row, 'result'] = some_calc_func(prev_result=df.loc[new_row, 'result'], prev_num=df.loc[new_row, 'nums'], \
current_b=df.loc[row, 'b'])
#3.25 s ± 392 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numba如果該函數some_calc_func可以很容易地與 Numba 裝飾器一起使用,則使用另一個名為 的庫進行編輯。
from numba import jit
# decorate your function
@jit
def some_calc_func(prev_result, prev_num, current_b):
if current_b == 1:
return prev_result * prev_num / 2
else:
return prev_num + 17
# create a function to do your job
# numba likes numpy arrays
@jit
def with_numba(prev_res, arr_nums, arr_b):
# array for results and initialize
arr_res = np.zeros_like(arr_nums)
arr_res[0] = prev_res
# loop on the length of arr_b
for i in range(len(arr_b)):
#do the calculation and set the value in result array
prev_res = some_calc_func (prev_res, arr_nums[i], arr_b[i])
arr_res[i+1] = prev_res
return arr_res
最后,稱它為
df['result'] = with_numba(df.loc[0, 'result'],
df['nums'].to_numpy(),
df.loc[1:, 'b'].to_numpy())
使用 timeit,我的速度比使用 zip 的方法快 9 倍,而且速度會隨著大小的增加而增加
%timeit df['result'] = with_numba(df.loc[0, 'result'],
df['nums'].to_numpy(),
df.loc[1:, 'b'].to_numpy())
# 526 μs ± 45.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
請注意,根據您的實際情況,使用 Numba 可能會出現問題some_calc_func

TA貢獻1794條經驗 獲得超8個贊
IIUC:
>>> df['result'] = (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums
).fillna(df.result).cumsum()
>>> df
nums b result
0 20.0 1 20.0
1 22.0 0 42.0
2 30.0 1 12.0
3 29.1 1 -17.1
4 20.0 0 2.9
解釋:
# replace 0 with 1 and 1 with -1 in column `b` for rows where result==0
>>> df[df.result.eq(0)].b.replace({0: 1, 1: -1})
1 1
2 -1
3 -1
4 1
Name: b, dtype: int64
# multiply with nums
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums)
0 NaN
1 22.0
2 -30.0
3 -29.1
4 20.0
dtype: float64
# fill the 'NaN' with the corresponding value from df.result (which is 20 here)
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums).fillna(df.result)
0 20.0
1 22.0
2 -30.0
3 -29.1
4 20.0
dtype: float64
# take the cumulative sum (cumsum)
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums).fillna(df.result).cumsum()
0 20.0
1 42.0
2 12.0
3 -17.1
4 2.9
dtype: float64
根據您在評論中的要求,我想不出沒有循環的方法:
c1, c2 = 2, 1
l = [df.loc[0, 'result']] # store the first result in a list
# then loop over the series (df.b * df.nums)
for i, val in (df.b * df.nums).iteritems():
if i: # except for 0th index
if val == 0: # (df.b * df.nums) == 0 if df.b == 0
l.append(l[-1]) # append the last result
else: # otherwise apply the rule
t = l[-1] *c2 + val * c1
l.append(t)
>>> l
[20.0, 20.0, 80.0, 138.2, 138.2]
>>> df['result'] = l
nums b result
0 20.0 1 20.0
1 22.0 0 20.0
2 30.0 1 80.0 # [ 20 * 1 + 30 * 2]
3 29.1 1 138.2 # [ 80 * 1 + 29.1 * 2]
4 20.0 0 138.2
似乎速度不夠快,沒有測試大樣本。

TA貢獻1847條經驗 獲得超11個贊
您有 af(...) 可以申請,但不能申請,因為您需要保留(前一)行的記憶。您可以使用閉包或類來執行此操作。下面是一個類的實現:
import pandas as pd
class Func():
def __init__(self, value):
self._prev = value
self._init = True
def __call__(self, x):
if self._init:
res = self._prev
self._init = False
elif x.b == 0:
res = x.nums - self._prev
else:
res = x.nums + self._prev
self._prev = res
return res
#df = pd.read_clipboard()
f = Func(20)
df['result'] = df.apply(f, axis=1)
你可以用__call__你想要的任何東西替換some_calc_func身體。

TA貢獻1872條經驗 獲得超4個贊
我意識到這就是@Prodipta 的答案,但這種方法使用global關鍵字來記住每次迭代的先前結果apply:
prev_result = 20
def my_calc(row):
global prev_result
i = int(row.name) #the index of the current row
if i==0:
return prev_result
elif row['b'] == 1:
out = prev_result * df.loc[i-1,'nums']/2 #loc to get prev_num
else:
out = df.loc[i-1,'nums'] + 17
prev_result = out
return out
df['result'] = df.apply(my_calc, axis=1)
您的示例數據的結果:
nums b result
0 20.0 1 20.0
1 22.0 0 37.0
2 30.0 1 407.0
3 29.1 1 6105.0
4 20.0 0 46.1
這是@Ben T 的答案的速度測試 - 不是最好的但也不是最差的?
In[0]
df = pd.DataFrame({'nums':np.random.randint(0,100,5000),'b':np.random.choice([0,1],5000)})
prev_result = 20
%%timeit
df['result'] = df.apply(my_calc, axis=1)
Out[0]
117 ms ± 5.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

TA貢獻1982條經驗 獲得超2個贊
重新使用你的循環和 some_calc_func
我正在使用您的循環并將其減少到最低限度,如下所示
for i in range(1, len(df)):
df.loc[i, 'result'] = some_calc_func(df.loc[i, 'b'], df.loc[i - 1, 'result'], df.loc[i, 'nums'])
并且some_calc_func實現如下
def some_calc_func(bval, prev_result, curr_num):
if bval == 0:
return prev_result + curr_num
else:
return prev_result - curr_num
結果如下
nums b result
0 20.0 1 20.0
1 22.0 0 42.0
2 30.0 1 12.0
3 29.1 1 -17.1
4 20.0 0 2.9
添加回答
舉報