首頁猿問如何在兩個條件下計算數據框中的值

如何在兩個條件下計算數據框中的值

Python

元芳怎么了 2023-05-23 15:55:23

我是熊貓的新手，我有一些數據的初始數據框。例如表 MхN 大小中從 0 到 999 的數字。# initial dataframe with random numbersnp.random.seed(123)M = 100N = 1000raw_df = pd.DataFrame(np.array([(np.random.choice([f'index_{i}' for i in range(1,5)]), *[np.random.randint(1000) for i in range(M)]) for n in range(N)]),columns=['index', *range(M)])raw_df.set_index('index', inplace = True) 像這樣：index 0 1 2 3 4 ... 95 96 97 98 99 index_3 365 382 322 988 98 ... 980 824 305 780 530index_2 513 51 940 885 745 ... 493 77 8 206 390index_2 222 198 552 887 970 ... 791 731 695 290 293index_2 855 853 665 401 186 ... 803 881 83 350 583index_4 855 501 851 886 334 ... 771 735 233 219 247我想像這樣計算特定索引的每個值：index 0 1 2 3 4 ... 995 996 997 998 999 index_1 19 19 29 30 19 ... 21 16 19 24 31index_2 26 29 32 18 18 ... 22 26 38 38 19index_3 24 23 32 36 22 ... 23 17 23 24 22index_4 41 21 24 28 26 ... 26 30 33 33 37我的代碼在 12 秒內完成。有沒有辦法做得更快？例如兩次# create new dfdf = pd.DataFrame(raw_df.index.unique(), columns=['index']).set_index('index')df.sort_index(inplace=True)# create new columnsunique_values = set()for column in raw_df.columns: unique_values.update(raw_df[column].unique())df_rows = sorted(unique_values, key=lambda x: int(x))# fill all dataframe by zerosfor row in df_rows: df.loc[:,str(row)] = 0# fill new dataframefor column in raw_df.columns: small_df = raw_df.groupby(by = ['index',column])[column].count().to_frame(name='count').reset_index() small_df.drop_duplicates() for index in small_df.index: name = small_df.at[index,'index'] # index_1 raw_column = small_df.at[index, column] # 6943 count = small_df.at[index,'count'] # 1 df[raw_column][name] += count

查看完整描述

4 回答

ITMISS

TA貢獻1871條經驗獲得超8個贊

這是一種方法。我從您創建的數據框開始。

t = (raw_df

.unstack() # move column labels down to row labels

.squeeze() # convert from data frame to series

.reset_index() # convert Index (row labels) to ordinary columns

.rename(columns={0: 'x', 'level_0': 'val'})

.pivot_table(index='x', columns='index', values='val', aggfunc='count')

)

print(t)

index index_1 index_2 index_3 index_4

0 19 26 24 41

1 19 29 23 21

10 24 31 25 29

100 17 28 15 18

101 25 16 27 19

.. ... ... ... ...

我只是調換了你的期望值，所以它更適合屏幕。

反對回復 2023-05-23

鳳凰求蠱

TA貢獻1825條經驗獲得超4個贊

更新更快：

def f(x):

y=np.bincount(x.to_numpy(dtype='int').flatten())

ii=np.nonzero(y)[0]

return pd.Series(y, index=ii)

raw_df.groupby(level=0).apply(f)

輸出：

0 1 2 3 4 5 6 7 8 9 ... 990 991 992 993 994 995 996 997 998 999

index ...

index_1 19 19 29 30 19 25 20 17 22 24 ... 23 21 23 25 22 21 16 19 24 31

index_2 26 29 32 18 18 22 24 22 22 24 ... 24 31 28 17 34 22 26 38 38 19

index_3 24 23 32 36 22 18 24 23 28 30 ... 29 23 25 21 25 23 17 23 24 22

index_4 41 21 24 28 26 33 28 29 31 19 ... 25 26 36 29 24 26 30 33 33 37

[4 rows x 1000 columns]

嘗試這個：

raw_df.groupby(level=0).apply(lambda x: pd.Series(dict(zip(*np.unique(x, return_counts=True)))))

輸出：

0 1 10 100 101 102 103 104 105 106 ... 990 991 992 993 994 995 996 997 998 999

index ...

index_1 19 19 24 17 25 32 25 17 21 22 ... 23 21 23 25 22 21 16 19 24 31

index_2 26 29 31 28 16 24 15 18 19 29 ... 24 31 28 17 34 22 26 38 38 19

index_3 24 23 25 15 27 21 22 31 24 21 ... 29 23 25 21 25 23 17 23 24 22

index_4 41 21 29 18 19 16 30 26 28 17 ... 25 26 36 29 24 26 30 33 33 37

[4 rows x 1000 columns]

反對回復 2023-05-23

胡說叔叔

TA貢獻1804條經驗獲得超8個贊

df1 = raw_df.stack().groupby(level=[0]).value_counts().unstack(1, fill_value=0)

df1

輸出：

0 1 10 100 101 102 103 104 105 106 107 108 109 11 110 111 112 113 114 115 116 117 118 119 12 120 121 122 123 124 125 126 127 128 129 13 130 131 132 133 ... 963 964 965 966 967 968 969 97 970 971 972 973 974 975 976 977 978 979 98 980 981 982 983 984 985 986 987 988 989 99 990 991 992 993 994 995 996 997 998 999

index

index_1 19 19 24 17 25 32 25 17 21 22 26 29 26 16 22 23 23 22 25 12 22 29 23 26 20 27 20 27 21 29 29 21 25 19 21 19 37 25 23 20 ... 18 23 24 31 31 19 27 29 21 25 24 27 27 33 22 26 26 17 24 27 23 24 21 20 24 31 20 22 24 28 23 21 23 25 22 21 16 19 24 31

index_2 26 29 31 28 16 24 15 18 19 29 24 20 18 18 29 21 20 27 20 27 22 22 27 16 27 17 25 24 18 28 23 32 23 38 25 21 22 27 24 19 ... 22 23 24 18 25 27 28 20 32 38 19 26 27 19 23 25 23 23 25 23 16 21 15 29 23 24 16 26 22 29 24 31 28 17 34 22 26 38 38 19

index_3 24 23 25 15 27 21 22 31 24 21 24 24 29 23 18 20 21 23 25 22 24 31 22 30 17 28 33 26 33 28 20 24 23 26 32 23 28 21 18 48 ... 22 26 23 26 27 15 25 29 29 25 34 21 38 24 18 19 22 30 25 21 23 23 29 38 29 20 26 26 19 30 29 23 25 21 25 23 17 23 24 22

index_4 41 21 29 18 19 16 30 26 28 17 22 18 33 30 33 22 30 25 26 36 25 28 25 23 20 28 35 36 31 28 17 31 30 32 31 20 28 15 28 21 ... 24 27 31 28 33 25 31 21 18 28 27 30 27 27 30 36 24 24 30 27 29 33 20 27 25 29 31 18 27 27 25 26 36 29 24 26 30 33 33 37

對于排序列：

p = list(range(0,1000))

for i in range(0, len(p)):

p[i] = str(p[i])

list(p)

df1 = df1.reindex(columns=p)

df1

結果：

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 ... 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999

index

index_1 19 19 29 30 19 25 20 17 22 24 24 16 20 19 25 26 24 25 22 26 23 20 33 12 17 22 21 28 24 17 26 20 22 24 35 22 23 23 23 28 ... 27 23 25 18 23 24 31 31 19 27 21 25 24 27 27 33 22 26 26 17 27 23 24 21 20 24 31 20 22 24 23 21 23 25 22 21 16 19 24 31

index_2 26 29 32 18 18 22 24 22 22 24 31 18 27 21 21 25 26 32 23 21 31 22 29 31 18 39 21 19 30 29 17 23 24 26 22 26 26 27 28 22 ... 22 21 27 22 23 24 18 25 27 28 32 38 19 26 27 19 23 25 23 23 23 16 21 15 29 23 24 16 26 22 24 31 28 17 34 22 26 38 38 19

index_3 24 23 32 36 22 18 24 23 28 30 25 23 17 23 39 23 41 32 14 21 34 23 26 22 27 21 27 16 27 25 27 19 28 23 24 33 26 15 22 19 ... 26 41 22 22 26 23 26 27 15 25 29 25 34 21 38 24 18 19 22 30 21 23 23 29 38 29 20 26 26 19 29 23 25 21 25 23 17 23 24 22

index_4 41 21 24 28 26 33 28 29 31 19 29 30 20 20 34 36 29 34 27 29 27 22 25 33 25 23 29 28 27 26 29 31 27 30 28 13 29 16 30 31 ... 25 27 23 24 27 31 28 33 25 31 18 28 27 30 27 27 30 36 24 24 27 29 33 20 27 25 29 31 18 27 25 26 36 29 24 26 30 33 33 37

反對回復 2023-05-23

慕碼人2483693

TA貢獻1860條經驗獲得超9個贊

你在我的筆記本電腦上的解決方案需要大約 43 秒，這在 0.16 秒內解決了

df = raw_df.groupby('index').apply(lambda x: x.values.flatten()).explode()

df = df.groupby(['index', df]).size().unstack()

df.columns = [int(i) for i in df.columns]

df.sort_index(axis=1, inplace=True)

輸出

0 1 2 3 4 5 6 7 8 ... 991 992 993 994 995 996 997 998 999

index ...

index_1 19 19 29 30 19 25 20 17 22 ... 21 23 25 22 21 16 19 24 31

index_2 26 29 32 18 18 22 24 22 22 ... 31 28 17 34 22 26 38 38 19

index_3 24 23 32 36 22 18 24 23 28 ... 23 25 21 25 23 17 23 24 22

index_4 41 21 24 28 26 33 28 29 31 ... 26 36 29 24 26 30 33 33 37

[4 rows x 1000 columns]

更新

以科學的名義并以理解所有提出的方法為唯一目標，這里是時間測試，每個選項一個循環并time.process_time()作為基準。

scottboston2 0.08s

richiev 0.14s

atanucse 0.16s

scottboston 0.30s

jsmart 0.39s

razor1ty 36.69s

如您所見，通過避免循環，所有答案至少快 100 倍。一般來說，所有答案都采用相同的重塑解決方案raw_df，然后按計數/大小聚合。

ScottBoston 的更新版本在 numpy 中完成了所有繁重的工作，而只是在 pandas 中進行了分組，到目前為止處于領先地位。

反對回復 2023-05-23

4 回答
0 關注
169 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

如何在兩個條件下計算數據框中的值

如何在兩個條件下計算數據框中的值

4 回答

添加回答