首頁猿問嘗試通過對其他列應用條件來過濾數據...

嘗試通過對其他列應用條件來過濾數據框中的列

Python

梵蒂岡之花 2023-09-05 17:15:10

我的 csv 文件中有 3 列： account_id 、 game_variant 、 no_of_games .... 表看起來像這樣account_id game_variant no_of_games130 a 2145 c 1130 b 4130 c 1142 a 3140 c 2145 b 5所以，我想提取變體 a,b,c,a∩b,b∩c,a∩c,a∩b∩c 中玩的游戲數量我能夠通過與 game_variant 分組并對 no_of_games 進行求和來單獨提取在 a、b、c 中玩的游戲，但無法邏輯地放入交叉部分。請幫我解決這個問題data_agg = df.groupby(['game_variant']).agg({'no_of_games':[np.sum]})提前致謝

查看完整描述

1 回答

一只甜甜圈

TA貢獻1836條經驗獲得超5個贊

這里的解決方案將根據每個玩家的級別返回交集。這還使用了defaultdict，因為這對于這種情況非常方便。我將解釋內聯代碼

from itertools import combinations

import pandas

from collections import defaultdict

from pprint import pprint # only needed for pretty printing of dictionary

df = pandas.read_csv('df.csv', sep='\s+') # assuming the data frame is in a file df.csv

# group by account_id to get subframes which only refer to one account.

data_agg2 = df.groupby(['account_id'])

# a defaultdict is a dictionary, where when no key is present, the function defined

# is used to create the element. This eliminates the check, if a key is

# already present or to set all combinations in advance.

games_played_2 = defaultdict(int)

# iterate over all accounts

for el in data_agg2.groups:

# extract the sub-dataframe from the gouped function

tmp = data_agg2.get_group(el)

# print(tmp) # you can uncomment this to see each account

# This is in principle the same loop as suggested before. However, as not every

# player has played all variants, one only has to create the number of combinations

# necessary for that player

for i in range(len(tmp.loc[:, 'no_of_games'])):

# As now the game_variant is a column and not the index, the first part of zip

# is slightly adapted. This loops over all combinations of variants for the

# current account.

for comb, combsum in zip(combinations(tmp.loc[:, 'game_variant'], i+1), combinations(tmp.loc[:, 'no_of_games'].values, i+1)):

# Here, each variant combination gets a unique key. Comb is sorted, as the

# variants might be not in alphabetic order. The number of games played for

# each variant for that player are added to the value of all players before.

games_played_2['_'.join(sorted(comb))] += sum(combsum)

pprint (games_played_2)

# returns

>> defaultdict(<class 'int'>,

{'a': 5,

'a_b': 6,

'a_b_c': 7,

'a_c': 3,

'b': 9,

'b_c': 11,

'c': 4})

由于您已經提取了它們的變體所玩的游戲數量，因此您可以簡單地將它們相加。如果您想自動執行此操作，則可以itertools.combinations在循環中使用它，該循環會迭代所有可能的組合長度：

from itertools import combinations

import pandas

import numpy as np

from pprint import pprint # only needed for pretty printing of dictionary

df = pandas.read_csv('df.csv', sep='\s+') # assuming the data frame is in a file df.csv

data_agg = df.groupby(['game_variant']).agg({'no_of_games':[np.sum]})

games_played = {}

for i in range(len(data_agg.loc[:, 'no_of_games'])):

for comb, combsum in zip(combinations(data_agg.index, i+1), combinations(data_agg.loc[:, 'no_of_games'].values, i+1)):

games_played['_'.join(comb)] = sum(combsum)

pprint(games_played)

>> {'a': array([5], dtype=int64),

>> 'a_b': array([14], dtype=int64),

>> 'a_b_c': array([18], dtype=int64),

>> 'a_c': array([9], dtype=int64),

>> 'b': array([9], dtype=int64),

>> 'b_c': array([13], dtype=int64),

>> 'c': array([4], dtype=int64)}

'combinations(sequence, number)'number返回中所有元素組合的迭代器sequence。因此，要獲得所有可能的組合，您必須迭代所有numbersfrom1到len(sequence。這就是第一個 for 循環的作用。

下一個for循環由兩個迭代器組成：一個迭代器覆蓋聚合數據的索引 ( combinations(data_agg.index, i+1))，一個迭代器覆蓋每個變體中實際玩的游戲數量 ( combinations(data_agg.loc[:, 'no_of_games'].values, i+1))。因此comb應該始終是變體列表，并匯總每個變體所玩游戲數量的列表。這里請注意，要獲取所有值，您必須使用.loc[:, 'no_games']，而不是.loc['no_games']，因為后者搜索名為的索引'no_games'，而它是列名。

然后，我將字典的鍵設置為變體列表的組合字符串，并將值設置為玩過的游戲數量的元素之和。

反對回復 2023-09-05

1 回答
0 關注
99 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

嘗試通過對其他列應用條件來過濾數據框中的列

嘗試通過對其他列應用條件來過濾數據框中的列

1 回答

添加回答