首頁猿問線性回歸的負精度

線性回歸的負精度

Python

喵喵時光機 2023-07-18 10:37:25

我的線性回歸模型的決定系數 R2為負。怎么會發生這種事呢？任何想法都有幫助。這是我的數據集：year,population1960,22151278.01961,22671191.01962,23221389.01963,23798430.01964,24397022.01965,25013626.01966,25641044.01967,26280132.01968,26944390.01969,27652709.01970,28415077.01971,29248643.01972,30140804.01973,31036662.01974,31861352.01975,32566854.01976,33128149.01977,33577242.01978,33993301.01979,34487799.01980,35141712.01981,35984528.01982,36995248.01983,38142674.01984,39374348.01985,40652141.01986,41965693.01987,43329231.01988,44757203.01989,46272299.01990,47887865.01991,49609969.01992,51423585.01993,53295566.01994,55180998.01995,57047908.01996,58883530.01997,60697443.01998,62507724.01999,64343013.02000,66224804.02001,68159423.02002,70142091.02003,72170584.02004,74239505.02005,76346311.02006,78489206.02007,80674348.02008,82916235.02009,85233913.02010,87639964.02011,90139927.02012,92726971.02013,95385785.02014,98094253.02015,100835458.02016,103603501.02017,106400024.02018,109224559.0模型的代碼LinearRegression如下：import pandas as pdfrom sklearn.linear_model import LinearRegressiondata =pd.read_csv("data.csv", header=None )data = data.drop(0,axis=0)X=data[0]Y=data[1]from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1,shuffle =False)lm = LinearRegression()lm.fit(X_train.values.reshape(-1,1), Y_train.values.reshape(-1,1))Y_pred = lm.predict(X_test.values.reshape(-1,1))accuracy = lm.score(Y_test.values.reshape(-1,1),Y_pred)print(accuracy)output-3592622948027972.5

查看完整描述

2 回答

慕蓋茨4494581

TA貢獻1850條經驗獲得超11個贊

以下是 R2 分數的公式：

\hat{y_i} 是第 i 個觀測值 y_i 的預測變量，\bar{y} 是所有觀測值的平均值。

因此，負 R2 意味著如果有人知道您樣本的平均值y_test并始終將其用作“預測”，則該“預測”將比您的模型更準確。

轉到您的數據集（感謝 @Prayson W. Daniel 提供了方便的加載腳本），讓我們快速瀏覽一下您的數據。

df.population.plot()

看起來對數變換可能會有所幫助。

import numpy as np

df_log = df.copy()

df_log.population = np.log(df.population)

df_log.population.plot()

現在讓我們使用 OpenTURNS 執行線性回歸。

import openturns as ot

sam = ot.Sample(np.array(df_log)) # convert DataFrame to openturns Sample

sam.setDescription(['year', 'logarithm of the population'])

linreg = ot.LinearModelAlgorithm(sam[:, 0], sam[:, 1])

linreg.run()

linreg_result = linreg.getResult()

coeffs = linreg_result.getCoefficients()

print("Best fitting line = {} + year * {}".format(coeffs[0], coeffs[1]))

print("R2 score = {}".format(linreg_result.getRSquared()))

ot.VisualTest_DrawLinearModel(sam[:, 0], sam[:, 1], linreg_result)

輸出：

Best fitting line = -38.35148311467912 + year * 0.028172928802559845

R2 score = 0.9966261033648469

這幾乎是精確的配合。

編輯

正如 @Prayson W. Daniel 所建議的，這是轉換回原始比例后的模型擬合。

# Get the original data in openturns Sample format

orig_sam = ot.Sample(np.array(df))

orig_sam.setDescription(df.columns)

# Compute the prediction in the original scale

predicted = ot.Sample(orig_sam) # start by copying the original data

predicted[:, 1] = np.exp(linreg_result.getMetaModel()(predicted[:, 0])) # overwrite with the predicted values

error = np.array((predicted - orig_sam)[:, 1]) # compute error

r2 = 1.0 - (error**2).mean() / df.population.var() # compute the R2 score in the original scale

print("R2 score in original scale = {}".format(r2))

# Plot the model

graph = ot.Graph("Original scale", "year", "population", True, '')

curve = ot.Curve(predicted)

graph.add(curve)

points = ot.Cloud(orig_sam)

points.setColor('red')

graph.add(points)

graph

輸出：

R2 score in original scale = 0.9979032805107133

反對回復 2023-07-18

繁華開滿天機

TA貢獻1816條經驗獲得超4個贊

Sckit-learn 的 LinearRegression 分數使用 ??2 分數。負 ??2 意味著該模型與您的數據擬合得非常糟糕。由于 ??2 將模型的擬合度與原假設（水平直線）的擬合度進行比較，因此當模型擬合度比水平線差時，??2 為負。

??2 = 1 - (SUM((y - ypred)**2) / SUM((y - AVG(y))**2))

因此，如果 SUM((y - ypred)**2大于SUM((y - AVG(y))**2，則 ??2 將為負數。

原因及糾正方法

問題 1：您正在執行時間序列數據的隨機分割。隨機分割將忽略時間維度。

解決方案：保留時間流（參見下面的代碼）

問題2：目標值太大。

解決方案：除非我們使用基于樹的模型，否則您將必須進行一些目標特征工程，以將數據縮放到模型可以學習的范圍內。

這是一個代碼示例。使用 LinearRegression 的默認參數和log|exp目標值的轉換，我的嘗試產生了約 87% 的 R2 分數：

import pandas as pd

import numpy as np

# we need to transform/feature engineer our target

# I will use log from numpy. The np.log and np.exp to make the value learnable

from sklearn.linear_model import LinearRegression

from sklearn.compose import TransformedTargetRegressor

# your data, df

# transform year to reference

df = df.assign(ref_year = lambda x: x.year - 1960)

df.population = df.population.astype(int)

split = int(df.shape[0] *.9) #split at 90%, 10%-ish

df = df[['ref_year', 'population']]

train_df = df.iloc[:split]

test_df = df.iloc[split:]

X_train = train_df[['ref_year']]

y_train = train_df.population

X_test = test_df[['ref_year']]

y_test = test_df.population

# regressor

regressor = LinearRegression()

lr = TransformedTargetRegressor(

regressor=regressor,

func=np.log, inverse_func=np.exp)

lr.fit(X_train,y_train)

print(lr.score(X_test,y_test))

對于那些有興趣讓它變得更好的人，這里有一種讀取該數據集的方法

import pandas as pd

import io

df = pd.read_csv(io.StringIO('''year,population

1960,22151278.0

1961,22671191.0

1962,23221389.0

1963,23798430.0

1964,24397022.0

1965,25013626.0

1966,25641044.0

1967,26280132.0

1968,26944390.0

1969,27652709.0

1970,28415077.0

1971,29248643.0

1972,30140804.0

1973,31036662.0

1974,31861352.0

1975,32566854.0

1976,33128149.0

1977,33577242.0

1978,33993301.0

1979,34487799.0

1980,35141712.0

1981,35984528.0

1982,36995248.0

1983,38142674.0

1984,39374348.0

1985,40652141.0

1986,41965693.0

1987,43329231.0

1988,44757203.0

1989,46272299.0

1990,47887865.0

1991,49609969.0

1992,51423585.0

1993,53295566.0

1994,55180998.0

1995,57047908.0

1996,58883530.0

1997,60697443.0

1998,62507724.0

1999,64343013.0

2000,66224804.0

2001,68159423.0

2002,70142091.0

2003,72170584.0

2004,74239505.0

2005,76346311.0

2006,78489206.0

2007,80674348.0

2008,82916235.0

2009,85233913.0

2010,87639964.0

2011,90139927.0

2012,92726971.0

2013,95385785.0

2014,98094253.0

2015,100835458.0

2016,103603501.0

2017,106400024.0

2018,109224559.0

'''))

結果：

反對回復 2023-07-18

2 回答
0 關注
160 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

線性回歸的負精度

線性回歸的負精度

2 回答

添加回答