首頁猿問如何從具有權重的數據創建箱線圖？

如何從具有權重的數據創建箱線圖？

Python

HUX布斯 2022-06-02 12:13:26

我有以下數據：aName名稱出現的次數 ( Count)，以及Score每個名稱的 a。我想創建一個的箱須圖，用它Score來加權每個名稱。ScoreCount結果應該與我擁有原始（而非頻率）形式的數據相同。但我不想將數據實際轉換為這種形式，因為它會很快膨脹。import pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltdata = { "Name":['Sara', 'John', 'Mark', 'Peter', 'Kate'], "Count":[20, 10, 5, 2, 5], "Score": [2, 4, 7, 8, 7]}df = pd.DataFrame(data)print(df) Count Name Score0 20 Sara 21 10 John 42 5 Mark 73 2 Peter 84 5 Kate 7我不確定如何在 Python 中解決這個問題。任何幫助表示贊賞！

查看完整描述

2 回答

紅顏莎娜

TA貢獻1842條經驗獲得超13個贊

這個問題遲到了，但如果它對遇到它的任何人有用 -

當您的權重是整數時，您可以使用 reindex 按計數擴展，然后直接使用 boxplot 調用。我已經能夠在幾千個變成幾十萬的數據幀上做到這一點而沒有內存挑戰，特別是如果實際重新索引的數據幀被包裝到第二個函數中，該函數沒有在內存中分配它。

import pandas as pd

import seaborn as sns

data = {

"Name": ['Sara', 'John', 'Mark', 'Peter', 'Kate'],

"Count": [20, 10, 5, 2, 5],

"Score": [2, 4, 7, 8, 7]

}

df = pd.DataFrame(data)

def reindex_df(df, weight_col):

"""expand the dataframe to prepare for resampling

result is 1 row per count per sample"""

df = df.reindex(df.index.repeat(df[weight_col]))

df.reset_index(drop=True, inplace=True)

return(df)

df = reindex_df(df, weight_col = 'Count')

sns.boxplot(x='Name', y='Score', data=df)

或者如果您擔心內存

def weighted_boxplot(df, weight_col):

sns.boxplot(x='Name',

y='Score',

data=reindex_df(df, weight_col = weight_col))

weighted_boxplot(df, 'Count')

反對回復 2022-06-02

白豬掌柜的

TA貢獻1893條經驗獲得超10個贊

這里有兩種方法來回答這個問題。您可能會期待第一個，但它不是一個好的計算解決方案confidence intervals of the median，它具有使用示例數據的以下代碼，引用matplotlib/cbook/__init__.py。因此，Second 比其他任何代碼都好得多，因為它經過了很好的測試，可以比較任何其他自定義代碼。

def boxplot_stats(X, whis=1.5, bootstrap=None, labels=None,

autorange=False):

def _bootstrap_median(data, N=5000):

# determine 95% confidence intervals of the median

M = len(data)

percentiles = [2.5, 97.5]

bs_index = np.random.randint(M, size=(N, M))

bsData = data[bs_index]

estimate = np.median(bsData, axis=1, overwrite_input=True)

第一的：

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

data = {

"Name": ['Sara', 'John', 'Mark', 'Peter', 'Kate'],

"Count": [20, 10, 5, 2, 5],

"Score": [2, 4, 7, 8, 7]

}

df = pd.DataFrame(data)

print(df)

def boxplot(values, freqs):

values = np.array(values)

freqs = np.array(freqs)

arg_sorted = np.argsort(values)

values = values[arg_sorted]

freqs = freqs[arg_sorted]

count = freqs.sum()

fx = values * freqs

mean = fx.sum() / count

variance = ((freqs * values ** 2).sum() / count) - mean ** 2

variance = count / (count - 1) * variance # dof correction for sample variance

std = np.sqrt(variance)

minimum = np.min(values)

maximum = np.max(values)

cumcount = np.cumsum(freqs)

print([std, variance])

Q1 = values[np.searchsorted(cumcount, 0.25 * count)]

Q2 = values[np.searchsorted(cumcount, 0.50 * count)]

Q3 = values[np.searchsorted(cumcount, 0.75 * count)]

'''

interquartile range (IQR), also called the midspread or middle 50%, or technically

H-spread, is a measure of statistical dispersion, being equal to the difference

between 75th and 25th percentiles, or between upper and lower quartiles,[1][2]

IQR = Q3 ? Q1. In other words, the IQR is the first quartile subtracted from

the third quartile; these quartiles can be clearly seen on a box plot on the data.

It is a trimmed estimator, defined as the 25% trimmed range, and is a commonly used

robust measure of scale.

'''

IQR = Q3 - Q1

'''

The whiskers add 1.5 times the IQR to the 75 percentile (aka Q3) and subtract

1.5 times the IQR from the 25 percentile (aka Q1). The whiskers should include

99.3% of the data if from a normal distribution. So the 6 foot tall man from

the example would be inside the whisker but my 6 foot 2 inch girlfriend would

be at the top whisker or pass it.

'''

whishi = Q3 + 1.5 * IQR

whislo = Q1 - 1.5 * IQR

stats = [{

'label': 'Scores', # tick label for the boxplot

'mean': mean, # arithmetic mean value

'iqr': Q3 - Q1, # 5.0,

# 'cilo': 2.0, # lower notch around the median

# 'cihi': 4.0, # upper notch around the median

'whishi': maximum, # end of the upper whisker

'whislo': minimum, # end of the lower whisker

'fliers': [], # '\array([], dtype=int64)', # outliers

'q1': Q1, # first quartile (25th percentile)

'med': Q2, # 50th percentile

'q3': Q3 # third quartile (75th percentile)

}]

fs = 10 # fontsize

_, axes = plt.subplots(nrows=1, ncols=1, figsize=(6, 6), sharey=True)

axes.bxp(stats)

axes.set_title('Default', fontsize=fs)

plt.show()

boxplot(df['Score'], df['Count'])

第二：

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

data = {

"Name": ['Sara', 'John', 'Mark', 'Peter', 'Kate'],

"Count": [20, 10, 5, 2, 5],

"Score": [2, 4, 7, 8, 7]

}

df = pd.DataFrame(data)

print(df)

labels = ['Scores']

data = df['Score'].repeat(df['Count']).tolist()

# compute the boxplot stats

stats = cbook.boxplot_stats(data, labels=labels, bootstrap=10000)

print(['stats :', stats])

fs = 10 # fontsize

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(6, 6), sharey=True)

axes.bxp(stats)

axes.set_title('Boxplot', fontsize=fs)

plt.show()

反對回復 2022-06-02

2 回答
0 關注
217 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

如何從具有權重的數據創建箱線圖？

如何從具有權重的數據創建箱線圖？

2 回答

添加回答

如何從具有權重的數據創建箱線圖？