2 回答

TA貢獻1785條經驗 獲得超4個贊
計算序列開始是否有效?然后只需設置忽略值(標志4)。像這樣:
sequence_starts = df.sequence == 2
sequence_ignore = df.sequence == 4
sequence_id = sequence_starts.cumsum()
sequence_id[sequence_ignore] = numpy.nan

TA貢獻1829條經驗 獲得超6個贊
我想不出比循環遍歷整個事物的“愚蠢”解決方案更好的方法,例如:
import numpy as np
counter = 0
tmp = np.empty_like(df['sequence'].values, dtype=np.float)
for i in range(len(tmp)):
if df['sequence'][i] == 4:
tmp[i] = np.nan
else:
if df['sequence'][i] == 2:
counter += 1
tmp[i] = counter
df['desired_Id_output'] = tmp
當然,這對于 20M 大小的 DataFrame 來說會很慢。改進這一點的一種方法是通過使用numba以下命令進行實時編譯:
import numba
@numba.njit
def foo(sequence):
# put in appropriate modification of the above code block
return tmp
并用參數調用它df['sequence'].values。
添加回答
舉報