亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

如何從具有權重的數據創建箱線圖?

如何從具有權重的數據創建箱線圖?

HUX布斯 2022-06-02 12:13:26
我有以下數據:aName名稱出現的次數 ( Count),以及Score每個名稱的 a。我想創建一個 的箱須圖,用它Score來加權每個名稱。ScoreCount結果應該與我擁有原始(而非頻率)形式的數據相同。但我不想將數據實際轉換為這種形式,因為它會很快膨脹。import pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltdata = {    "Name":['Sara', 'John', 'Mark', 'Peter', 'Kate'],    "Count":[20, 10, 5, 2, 5],     "Score": [2, 4, 7, 8, 7]}df = pd.DataFrame(data)print(df)   Count   Name  Score0     20   Sara      21     10   John      42      5   Mark      73      2  Peter      84      5   Kate      7我不確定如何在 Python 中解決這個問題。任何幫助表示贊賞!
查看完整描述

2 回答

?
紅顏莎娜

TA貢獻1842條經驗 獲得超13個贊

這個問題遲到了,但如果它對遇到它的任何人有用 -


當您的權重是整數時,您可以使用 reindex 按計數擴展,然后直接使用 boxplot 調用。我已經能夠在幾千個變成幾十萬的數據幀上做到這一點而沒有內存挑戰,特別是如果實際重新索引的數據幀被包裝到第二個函數中,該函數沒有在內存中分配它。


import pandas as pd

import seaborn as sns


data = {

    "Name": ['Sara', 'John', 'Mark', 'Peter', 'Kate'],

    "Count": [20, 10, 5, 2, 5],

    "Score": [2, 4, 7, 8, 7]

}

df = pd.DataFrame(data)


def reindex_df(df, weight_col):

    """expand the dataframe to prepare for resampling

    result is 1 row per count per sample"""

    df = df.reindex(df.index.repeat(df[weight_col]))

    df.reset_index(drop=True, inplace=True)

    return(df)


df = reindex_df(df, weight_col = 'Count')


sns.boxplot(x='Name', y='Score', data=df)

或者如果您擔心內存


def weighted_boxplot(df, weight_col):

    sns.boxplot(x='Name', 

                y='Score', 

                data=reindex_df(df, weight_col = weight_col))

    

weighted_boxplot(df, 'Count')


查看完整回答
反對 回復 2022-06-02
?
白豬掌柜的

TA貢獻1893條經驗 獲得超10個贊

這里有兩種方法來回答這個問題。您可能會期待第一個,但它不是一個好的計算解決方案confidence intervals of the median,它具有使用示例數據的以下代碼,引用matplotlib/cbook/__init__.py。因此,Second 比其他任何代碼都好得多,因為它經過了很好的測試,可以比較任何其他自定義代碼。


def boxplot_stats(X, whis=1.5, bootstrap=None, labels=None,

                  autorange=False):

    def _bootstrap_median(data, N=5000):

        # determine 95% confidence intervals of the median

        M = len(data)

        percentiles = [2.5, 97.5]


        bs_index = np.random.randint(M, size=(N, M))

        bsData = data[bs_index]

        estimate = np.median(bsData, axis=1, overwrite_input=True)

第一的:


import pandas as pd

import matplotlib.pyplot as plt

import numpy as np


data = {

    "Name": ['Sara', 'John', 'Mark', 'Peter', 'Kate'],

    "Count": [20, 10, 5, 2, 5],

    "Score": [2, 4, 7, 8, 7]

}


df = pd.DataFrame(data)

print(df)



def boxplot(values, freqs):

    values = np.array(values)

    freqs = np.array(freqs)

    arg_sorted = np.argsort(values)

    values = values[arg_sorted]

    freqs = freqs[arg_sorted]

    count = freqs.sum()

    fx = values * freqs

    mean = fx.sum() / count

    variance = ((freqs * values ** 2).sum() / count) - mean ** 2

    variance = count / (count - 1) * variance  # dof correction for sample variance

    std = np.sqrt(variance)

    minimum = np.min(values)

    maximum = np.max(values)

    cumcount = np.cumsum(freqs)


    print([std, variance])

    Q1 = values[np.searchsorted(cumcount, 0.25 * count)]

    Q2 = values[np.searchsorted(cumcount, 0.50 * count)]

    Q3 = values[np.searchsorted(cumcount, 0.75 * count)]


    '''

    interquartile range (IQR), also called the midspread or middle 50%, or technically

    H-spread, is a measure of statistical dispersion, being equal to the difference

    between 75th and 25th percentiles, or between upper and lower quartiles,[1][2]

    IQR = Q3 ?  Q1. In other words, the IQR is the first quartile subtracted from

    the third quartile; these quartiles can be clearly seen on a box plot on the data.

    It is a trimmed estimator, defined as the 25% trimmed range, and is a commonly used

    robust measure of scale.

    '''


    IQR = Q3 - Q1


    '''

    The whiskers add 1.5 times the IQR to the 75 percentile (aka Q3) and subtract

    1.5 times the IQR from the 25 percentile (aka Q1).  The whiskers should include

    99.3% of the data if from a normal distribution.  So the 6 foot tall man from

    the example would be inside the whisker but my 6 foot 2 inch girlfriend would

    be at the top whisker or pass it.

    '''

    whishi = Q3 + 1.5 * IQR

    whislo = Q1 - 1.5 * IQR


    stats = [{

        'label': 'Scores',  # tick label for the boxplot

        'mean': mean,  # arithmetic mean value

        'iqr': Q3 - Q1,  # 5.0,

#         'cilo': 2.0,  # lower notch around the median

#         'cihi': 4.0,  # upper notch around the median

        'whishi': maximum,  # end of the upper whisker

        'whislo': minimum,  # end of the lower whisker

        'fliers': [],  # '\array([], dtype=int64)',  # outliers

        'q1': Q1,  # first quartile (25th percentile)

        'med': Q2,  # 50th percentile

        'q3': Q3  # third quartile (75th percentile)

    }]


    fs = 10  # fontsize

    _, axes = plt.subplots(nrows=1, ncols=1, figsize=(6, 6), sharey=True)

    axes.bxp(stats)

    axes.set_title('Default', fontsize=fs)

    plt.show()



boxplot(df['Score'], df['Count'])


第二:


import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt



data = {

    "Name": ['Sara', 'John', 'Mark', 'Peter', 'Kate'],

    "Count": [20, 10, 5, 2, 5],

    "Score": [2, 4, 7, 8, 7]

}


df = pd.DataFrame(data)

print(df)


labels = ['Scores']


data = df['Score'].repeat(df['Count']).tolist()


# compute the boxplot stats

stats = cbook.boxplot_stats(data, labels=labels, bootstrap=10000)


print(['stats :', stats])


fs = 10  # fontsize


fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(6, 6), sharey=True)

axes.bxp(stats)

axes.set_title('Boxplot', fontsize=fs)


plt.show()


查看完整回答
反對 回復 2022-06-02
  • 2 回答
  • 0 關注
  • 217 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯系客服咨詢優惠詳情

幫助反饋 APP下載

慕課網APP
您的移動學習伙伴

公眾號

掃描二維碼
關注慕課網微信公眾號