2 回答

TA貢獻1850條經驗 獲得超11個贊
以下是 R2 分數的公式:
\hat{y_i} 是第 i 個觀測值 y_i 的預測變量,\bar{y} 是所有觀測值的平均值。
因此,負 R2 意味著如果有人知道您樣本的平均值y_test
并始終將其用作“預測”,則該“預測”將比您的模型更準確。
轉到您的數據集(感謝 @Prayson W. Daniel 提供了方便的加載腳本),讓我們快速瀏覽一下您的數據。
df.population.plot()
看起來對數變換可能會有所幫助。
import numpy as np
df_log = df.copy()
df_log.population = np.log(df.population)
df_log.population.plot()
現在讓我們使用 OpenTURNS 執行線性回歸。
import openturns as ot
sam = ot.Sample(np.array(df_log)) # convert DataFrame to openturns Sample
sam.setDescription(['year', 'logarithm of the population'])
linreg = ot.LinearModelAlgorithm(sam[:, 0], sam[:, 1])
linreg.run()
linreg_result = linreg.getResult()
coeffs = linreg_result.getCoefficients()
print("Best fitting line = {} + year * {}".format(coeffs[0], coeffs[1]))
print("R2 score = {}".format(linreg_result.getRSquared()))
ot.VisualTest_DrawLinearModel(sam[:, 0], sam[:, 1], linreg_result)
輸出:
Best fitting line = -38.35148311467912 + year * 0.028172928802559845
R2 score = 0.9966261033648469
這幾乎是精確的配合。
編輯
正如 @Prayson W. Daniel 所建議的,這是轉換回原始比例后的模型擬合。
# Get the original data in openturns Sample format
orig_sam = ot.Sample(np.array(df))
orig_sam.setDescription(df.columns)
# Compute the prediction in the original scale
predicted = ot.Sample(orig_sam) # start by copying the original data
predicted[:, 1] = np.exp(linreg_result.getMetaModel()(predicted[:, 0])) # overwrite with the predicted values
error = np.array((predicted - orig_sam)[:, 1]) # compute error
r2 = 1.0 - (error**2).mean() / df.population.var() # compute the R2 score in the original scale
print("R2 score in original scale = {}".format(r2))
# Plot the model
graph = ot.Graph("Original scale", "year", "population", True, '')
curve = ot.Curve(predicted)
graph.add(curve)
points = ot.Cloud(orig_sam)
points.setColor('red')
graph.add(points)
graph
輸出:
R2 score in original scale = 0.9979032805107133

TA貢獻1816條經驗 獲得超4個贊
Sckit-learn 的 LinearRegression 分數使用 ??2 分數。負 ??2 意味著該模型與您的數據擬合得非常糟糕。由于 ??2 將模型的擬合度與原假設(水平直線)的擬合度進行比較,因此當模型擬合度比水平線差時,??2 為負。
??2 = 1 - (SUM((y - ypred)**2) / SUM((y - AVG(y))**2))
因此,如果 SUM((y - ypred)**2大于SUM((y - AVG(y))**2,則 ??2 將為負數。
原因及糾正方法
問題 1:您正在執行時間序列數據的隨機分割。隨機分割將忽略時間維度。
解決方案:保留時間流(參見下面的代碼)
問題2:目標值太大。
解決方案:除非我們使用基于樹的模型,否則您將必須進行一些目標特征工程,以將數據縮放到模型可以學習的范圍內。
這是一個代碼示例。使用 LinearRegression 的默認參數和log|exp目標值的轉換,我的嘗試產生了約 87% 的 R2 分數:
import pandas as pd
import numpy as np
# we need to transform/feature engineer our target
# I will use log from numpy. The np.log and np.exp to make the value learnable
from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor
# your data, df
# transform year to reference
df = df.assign(ref_year = lambda x: x.year - 1960)
df.population = df.population.astype(int)
split = int(df.shape[0] *.9) #split at 90%, 10%-ish
df = df[['ref_year', 'population']]
train_df = df.iloc[:split]
test_df = df.iloc[split:]
X_train = train_df[['ref_year']]
y_train = train_df.population
X_test = test_df[['ref_year']]
y_test = test_df.population
# regressor
regressor = LinearRegression()
lr = TransformedTargetRegressor(
regressor=regressor,
func=np.log, inverse_func=np.exp)
lr.fit(X_train,y_train)
print(lr.score(X_test,y_test))
對于那些有興趣讓它變得更好的人,這里有一種讀取該數據集的方法
import pandas as pd
import io
df = pd.read_csv(io.StringIO('''year,population
1960,22151278.0
1961,22671191.0
1962,23221389.0
1963,23798430.0
1964,24397022.0
1965,25013626.0
1966,25641044.0
1967,26280132.0
1968,26944390.0
1969,27652709.0
1970,28415077.0
1971,29248643.0
1972,30140804.0
1973,31036662.0
1974,31861352.0
1975,32566854.0
1976,33128149.0
1977,33577242.0
1978,33993301.0
1979,34487799.0
1980,35141712.0
1981,35984528.0
1982,36995248.0
1983,38142674.0
1984,39374348.0
1985,40652141.0
1986,41965693.0
1987,43329231.0
1988,44757203.0
1989,46272299.0
1990,47887865.0
1991,49609969.0
1992,51423585.0
1993,53295566.0
1994,55180998.0
1995,57047908.0
1996,58883530.0
1997,60697443.0
1998,62507724.0
1999,64343013.0
2000,66224804.0
2001,68159423.0
2002,70142091.0
2003,72170584.0
2004,74239505.0
2005,76346311.0
2006,78489206.0
2007,80674348.0
2008,82916235.0
2009,85233913.0
2010,87639964.0
2011,90139927.0
2012,92726971.0
2013,95385785.0
2014,98094253.0
2015,100835458.0
2016,103603501.0
2017,106400024.0
2018,109224559.0
'''))
結果:
添加回答
舉報