亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

ValueError:值的長度與嵌套循環中的索引長度不匹配

ValueError:值的長度與嵌套循環中的索引長度不匹配

九州編程 2022-11-29 17:05:41
我正在嘗試刪除列中每一行的停用詞。列包含行和行,因為我已經有了word_tokenized它,nltk現在它是一個包含元組的列表。我試圖用這個嵌套列表理解刪除停用詞,但它說ValueError: Length of values does not match length of index in nested loop。如何解決這個問題?import pandas as pdfrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizedata = pd.read_csv(r"D:/python projects/read_files/spam.csv",                    encoding = "latin-1")data = data[['v1','v2']]data = data.rename(columns = {'v1': 'label', 'v2': 'text'})stopwords = set(stopwords.words('english'))data['text'] = data['text'].str.lower()data['new'] = [word_tokenize(row) for row in data['text']]data['new'] = [word for new in data['new'] for word in new if word not in stopwords]我的文本數據data['text'].head(5)Out[92]: 0    go until jurong point, crazy.. available only ...1                        ok lar... joking wif u oni...2    free entry in 2 a wkly comp to win fa cup fina...3    u dun say so early hor... u c already then say...4    nah i don't think he goes to usf, he lives aro...Name: text, dtype: object在我word_tokenized用 nltk之后data['new'].head(5)Out[89]: 0    [go, until, jurong, point, ,, crazy.., availab...1             [ok, lar, ..., joking, wif, u, oni, ...]2    [free, entry, in, 2, a, wkly, comp, to, win, f...3    [u, dun, say, so, early, hor, ..., u, c, alrea...4    [nah, i, do, n't, think, he, goes, to, usf, ,,...Name: new, dtype: object回溯runfile('D:/python projects/NLP_nltk_first.py', wdir='D:/python projects')Traceback (most recent call last):  File "D:\python projects\NLP_nltk_first.py", line 36, in <module>    data['new'] = [new for new in data['new'] for word in new if word not in stopwords]  File "C:\Users\Ramadhina\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3487, in __setitem__    self._set_item(key, value)
查看完整描述

1 回答

?
子衿沉夜

TA貢獻1828條經驗 獲得超3個贊

仔細閱讀錯誤信息:


ValueError:值的長度與索引的長度不匹配


在這種情況下,“值”是右邊的東西=:


values = [word for new in data['new'] for word in new if word not in stopwords]

本例中的“索引”是 DataFrame 的行索引:


index = data.index

這里index的行數始終與 DataFrame 本身的行數相同。


問題是values對于index- 即它們對于 DataFrame 來說太長了。如果你檢查你的代碼,這應該是顯而易見的。如果您仍然看不到問題,請嘗試以下操作:


data['text_tokenized'] = [word_tokenize(row) for row in data['text']]


values = [word for new in data['text_tokenized'] for word in new if word not in stopwords]


print('N rows:', data.shape[0])

print('N new values:', len(values))

至于如何解決問題——這完全取決于您要達到的目標。一種選擇是“分解”數據(還要注意使用.map而不是列表理解):


data['text_tokenized'] = data['text'].map(word_tokenize)


# Flatten the token lists without a nested list comprehension

tokens_flat = data['text_tokenized'].explode()


# Join your labels w/ your flattened tokens, if desired

data_flat = data[['label']].join(tokens_flat)


# Add a 2nd index level to track token appearance order,

# might make your life easier 

data_flat['token_id'] = data.groupby(level=0).cumcount()

data_flat = data_flat.set_index('token_id', append=True)

作為一個不相關的提示,您可以通過僅加載您需要的列來提高 CSV 處理的效率,如下所示:


data = pd.read_csv(r"D:/python projects/read_files/spam.csv",

                    encoding="latin-1",

                    usecols=["v1", "v2"])


查看完整回答
反對 回復 2022-11-29
  • 1 回答
  • 0 關注
  • 253 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯系客服咨詢優惠詳情

幫助反饋 APP下載

慕課網APP
您的移動學習伙伴

公眾號

掃描二維碼
關注慕課網微信公眾號