4 回答

TA貢獻1942條經驗 獲得超3個贊
您的新數據可能與您用于訓練和測試模型的第一個數據集有很大不同。預處理技術和統計分析將幫助您表征數據并比較不同的數據集。由于各種原因,可能會觀察到新數據的性能不佳,包括:
您的初始數據集在統計上不能代表更大的數據集(例如:您的數據集是一個極端案例)
過度擬合:你過度訓練你的模型,它包含訓練數據的特異性(噪聲)
不同的預處理方法
不平衡的訓練數據集。ML 技術最適合平衡數據集(訓練集中不同類別的平等出現)

TA貢獻1799條經驗 獲得超8個贊
我對情緒分析中不同分類的表現進行了調查研究。對于特定的推特數據集,我曾經執行邏輯回歸、樸素貝葉斯、支持向量機、k 最近鄰 (KNN) 和決策樹等模型。對所選數據集的觀察表明,Logistic 回歸和樸素貝葉斯在所有類型的測試中都準確地表現良好。接下來是SVM。然后進行準確的決策樹分類。從結果來看,KNN 的準確度得分最低。邏輯回歸和樸素貝葉斯模型在情緒分析和預測方面分別表現更好。 情緒分類器(準確度分數 RMSE) LR (78.3541 1.053619) NB (76.764706 1.064738) SVM (73.5835 1.074752) DT (69.2941 1.145234) KNN (62.9476 1.376589)
在這些情況下,特征提取非常關鍵。

TA貢獻2039條經驗 獲得超8個贊
導入必需品
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import time
df = pd.read_csv('FilePath', header=0)
X = df['content']
y = df['sentiment']
def lrSentimentAnalysis(n):
# Using CountVectorizer to convert text into tokens/features
vect = CountVectorizer(ngram_range=(1, 1))
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=n)
# Using training data to transform text into counts of features for each message
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)
# dual = [True, False]
max_iter = [100, 110, 120, 130, 140, 150]
C = [1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5]
solvers = ['newton-cg', 'lbfgs', 'liblinear']
param_grid = dict(max_iter=max_iter, C=C, solver=solvers)
LR1 = LogisticRegression(penalty='l2', multi_class='auto')
grid = GridSearchCV(estimator=LR1, param_grid=param_grid, cv=10, n_jobs=-1)
grid_result = grid.fit(X_train_dtm, y_train)
# Summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
y_pred = grid_result.predict(X_test_dtm)
print ('Accuracy Score: ', metrics.accuracy_score(y_test, y_pred) * 100, '%')
# print('Confusion Matrix: ',metrics.confusion_matrix(y_test,y_pred))
# print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
# print('MSE:', metrics.mean_squared_error(y_test, y_pred))
print ('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
return [n, metrics.accuracy_score(y_test, y_pred) * 100, grid_result.best_estimator_.get_params()['max_iter'],
grid_result.best_estimator_.get_params()['C'], grid_result.best_estimator_.get_params()['solver']]
def darwConfusionMetrix(accList):
# Using CountVectorizer to convert text into tokens/features
vect = CountVectorizer(ngram_range=(1, 1))
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=accList[0])
# Using training data to transform text into counts of features for each message
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)
# Accuracy using Logistic Regression Model
LR = LogisticRegression(penalty='l2', max_iter=accList[2], C=accList[3], solver=accList[4])
LR.fit(X_train_dtm, y_train)
y_pred = LR.predict(X_test_dtm)
# creating a heatmap for confusion matrix
data = metrics.confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(data, columns=np.unique(y_test), index=np.unique(y_test))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
plt.figure(figsize=(10, 7))
sns.set(font_scale=1.4) # for label size
sns.heatmap(df_cm, cmap="Blues", annot=True, annot_kws={"size": 16}) # font size
fig0 = plt.gcf()
fig0.show()
fig0.savefig('FilePath', dpi=100)
def findModelWithBestAccuracy(accList):
accuracyList = []
for item in accList:
accuracyList.append(item[1])
N = accuracyList.index(max(accuracyList))
print('Best Model:', accList[N])
return accList[N]
accList = []
print('Logistic Regression')
print('grid search method for hyperparameter tuning (accurcy by cross validation) ')
for i in range(2, 7):
n = i / 10.0
print ("\nsplit ", i - 1, ": n=", n)
accList.append(lrSentimentAnalysis(n))
darwConfusionMetrix(findModelWithBestAccuracy(accList))

TA貢獻1794條經驗 獲得超8個贊
預處理是構建性能良好的分類器的重要部分。當您在訓練和測試集性能之間存在如此大的差異時,很可能在您的(測試集)預處理中發生了一些錯誤。
無需任何編程也可使用分類器。
您可以訪問 Web 服務洞察分類器并先嘗試免費構建。
添加回答
舉報