首頁猿問平衡具有特定對的行數

平衡具有特定對的行數

Python

拉丁的傳說 2023-10-18 15:56:46

所以，我有一個 pandas 數據框，看起來像這樣： data | Flag | Set----------------------------- 0 | True | A 30 | True | A -1 | False | A 20 | True | B 5 | False | B 19 | False | B 7 | False | C 8 | False | c我怎樣才能（優雅地）以這樣的方式刪除行，使得對于每組，都有相同數量的True和False Flags？輸出看起來像這樣 data | Flag | Set----------------------------- 0 | True | A -1 | False | A 20 | True | B 5 | False | B對于A，有 1 個假標志，因為B有 1 個真標志，并且C有 0 個真標志。我知道如何暴力破解，但我覺得有一些我不知道的優雅方法。

查看完整描述

3 回答

慕村225694

TA貢獻1880條經驗獲得超4個贊

首先獲取Flag每個Setby的計數crosstab，過濾掉行0- 它意味著唯一True或False值，并獲取字典的最小值d：

df1 = pd.crosstab(df['Set'], df['Flag'])

d = df1[df1.ne(0).all(axis=1)].min(axis=1).to_dict()

print (d)

{'A': 1, 'B': 1}

然后按Set字典的列和鍵過濾行，然后DataFrame.head按組使用dict：

df1 = (df[df['Set'].isin(d.keys())]

? ? ? ? ? ?.groupby(['Set', 'Flag'], group_keys=False)

? ? ? ? ? ?.apply(lambda x: x.head(d[x.name[0]])))

print (df1)

? ?data? ?Flag Set

2? ? -1? False? ?A

0? ? ?0? ?True? ?A

4? ? ?5? False? ?B

3? ? 20? ?True? ?B

編輯：對于驗證返回的解決方案，如果有 2 次True且False每組A：

print (df)

? ?data? ?Flag Set

0? ? ?0? ?True? ?A

1? ? ?8? ?True? ?A

2? ? 30? ?True? ?A

3? ? -1? False? ?A

4? ?-14? False? ?A

5? ? 20? ?True? ?B

6? ? ?5? False? ?B

7? ? 19? False? ?B

8? ? ?7? False? ?C

9? ? ?8? False? ?c

df1 = pd.crosstab(df['Set'], df['Flag'])

d = df1[df1.ne(0).all(axis=1)].min(axis=1).to_dict()

print (d)

{'A': 2, 'B': 1}

df1 = (df[df['Set'].isin(d.keys())]

? ? ? ? ? ?.groupby(['Set', 'Flag'], group_keys=False)

? ? ? ? ? ?.apply(lambda x: x.head(d[x.name[0]])))

print (df1)

? ?data? ?Flag Set

3? ? -1? False? ?A

4? ?-14? False? ?A

0? ? ?0? ?True? ?A

1? ? ?8? ?True? ?A

6? ? ?5? False? ?B

5? ? 20? ?True? ?B

反對回復 2023-10-18

叮當貓咪

TA貢獻1776條經驗獲得超12個贊

這可能是一個可能的解決方案，包含 3 個步驟：

刪除所有沒有 true 和 false 標志的集合（此處為 C）
計算每個設置標志組合所需的行數
刪除超過該計數行數的所有行

這會產生以下代碼：

df = pd.DataFrame(data={"data":[0, 30, -1, 20, 5, 19, 7, 8],

"Flag":[True, True, False, True, False, False, False, False],

"Set":["A", "A", "A", "B", "B", "B", "C", "C"]})

# 1. removing sets with only one of both flags

reducer = df.groupby("Set")["Flag"].transform("nunique") > 1

df_reduced = df.loc[reducer]

# 2. counting the minimum number of rows per set

counts = df_reduced.groupby(["Set", "Flag"]).count().groupby("Set").min()

# 3. reducing each set and flag to the minumum number of rows

df_equal = df_reduced.groupby(["Set", "Flag"]) \

.apply(lambda x: x.head(counts.loc[x["Set"].values[0]][0])) \

.reset_index(drop=True)

反對回復 2023-10-18

ITMISS

TA貢獻1871條經驗獲得超8個贊

編輯：我想出了一個易于理解、簡潔的解決方案：

只需獲取.cumcount()分組依據set和flag
檢查一組set和cumcount上面的結果（cc下面的代碼）是否重復。如果一個組不包含重復項，則意味著需要將其刪除。

In[1]:

data Flag Set

0 0 True A

1 8 True A

2 30 True A

3 0 True A

4 8 True A

5 30 True A

6 -1 False A

7 -14 False A

8 -1 False A

9 -14 False A

10 20 True B

11 5 False B

12 19 False B

13 7 False C

14 8 False c

編輯2：根據@Jezrael，我可以進一步簡化以下三行代碼：

df = (df[df.assign(cc = df.groupby(['Set', 'Flag'])

.cumcount()).duplicated(['Set','cc'], keep=False)])

下面的代碼進一步細分。

df['cc'] = df.groupby(['Set', 'Flag']).cumcount()

s = df.duplicated(['Set','cc'], keep=False)

df = df[s].drop('cc', axis=1)

Out[1]:

data Flag Set

0 0 True A

1 8 True A

2 30 True A

3 0 True A

6 -1 False A

7 -14 False A

8 -1 False A

9 -14 False A

10 20 True B

11 5 False B

在刪除之前，數據如下所示：

df['cc'] = df.groupby(['Set', 'Flag']).cumcount()

df['s'] = df.duplicated(['Set','cc'], keep=False)

# df = df[df['s']].drop('cc', axis=1)

Out[1]:

data Flag Set cc s

0 0 True A 0 True

1 8 True A 1 True

2 30 True A 2 True

3 0 True A 3 True

4 8 True A 4 False

5 30 True A 5 False

6 -1 False A 0 True

7 -14 False A 1 True

8 -1 False A 2 True

9 -14 False A 3 True

10 20 True B 0 True

11 5 False B 0 True

12 19 False B 1 False

13 7 False C 0 False

14 8 False c 0 False

然后，False列中的行s被刪除df = df[df['s']]

反對回復 2023-10-18

3 回答
0 關注
141 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

平衡具有特定對的行數

平衡具有特定對的行數

3 回答

添加回答