亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

TfidfVectorizer 和 SelectKBest 錯誤

TfidfVectorizer 和 SelectKBest 錯誤

www說 2023-09-05 20:23:04
我正在嘗試按照本教程進行一些情感分析,并且我很確定到目前為止我的代碼完全相同。然而,我的 BOW 值出現了重大差異。https://www.tensorscience.com/nlp/sentiment-analysis-tutorial-in-python-classifying-reviews-on-movies-and-products到目前為止,這是我的代碼。import nltkimport pandas as pdimport stringfrom nltk.corpus import stopwordsfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.feature_selection import SelectKBest, chi2def openFile(path):    #param path: path/to/file.ext (str)    #Returns contents of file (str)    with open(path) as file:        data = file.read()    return dataimdb_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/imdb_labelled.txt')amzn_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/amazon_cells_labelled.txt')yelp_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/yelp_labelled.txt')datasets = [imdb_data, amzn_data, yelp_data]combined_dataset = []# separate samples from each otherfor dataset in datasets:    combined_dataset.extend(dataset.split('\n'))# separate each label from each sampledataset = [sample.split('\t') for sample in combined_dataset]df = pd.DataFrame(data=dataset, columns=['Reviews', 'Labels'])df = df[df["Labels"].notnull()]df = df.sample(frac=1)labels = df['Labels']vectorizer = TfidfVectorizer(min_df=15)bow = vectorizer.fit_transform(df['Reviews'])len(vectorizer.get_feature_names())selected_features = SelectKBest(chi2, k=200).fit(bow, labels).get_support(indices=True)vectorizer = TfidfVectorizer(min_df=15, vocabulary=selected_features)bow = vectorizer.fit_transform(df['Reviews'])bow這是我的結果。這是教程的結果。我一直在試圖找出可能出現的問題,但還沒有任何進展。
查看完整描述

1 回答

?
LEATH

TA貢獻1936條經驗 獲得超7個贊

問題是您正在提供索引,請嘗試提供真正的詞匯。


嘗試這個:


selected_features = SelectKBest(chi2, k=200).fit(bow, labels).get_support(indices=True)

vocabulary = np.array(vectorizer.get_feature_names())[selected_features]


vectorizer = TfidfVectorizer(min_df=15, vocabulary=vocabulary) # you need to supply a real vocab here


bow = vectorizer.fit_transform(df['Reviews'])

bow

<3000x200 sparse matrix of type '<class 'numpy.float64'>'

    with 12916 stored elements in Compressed Sparse Row format>


查看完整回答
反對 回復 2023-09-05
  • 1 回答
  • 0 關注
  • 117 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯系客服咨詢優惠詳情

幫助反饋 APP下載

慕課網APP
您的移動學習伙伴

公眾號

掃描二維碼
關注慕課網微信公眾號