首頁猿問 ValueError：值的長度與嵌...

ValueError：值的長度與嵌套循環中的索引長度不匹配

Python

九州編程 2022-11-29 17:05:41

我正在嘗試刪除列中每一行的停用詞。列包含行和行，因為我已經有了word_tokenized它，nltk現在它是一個包含元組的列表。我試圖用這個嵌套列表理解刪除停用詞，但它說ValueError: Length of values does not match length of index in nested loop。如何解決這個問題？import pandas as pdfrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizedata = pd.read_csv(r"D:/python projects/read_files/spam.csv", encoding = "latin-1")data = data[['v1','v2']]data = data.rename(columns = {'v1': 'label', 'v2': 'text'})stopwords = set(stopwords.words('english'))data['text'] = data['text'].str.lower()data['new'] = [word_tokenize(row) for row in data['text']]data['new'] = [word for new in data['new'] for word in new if word not in stopwords]我的文本數據data['text'].head(5)Out[92]: 0 go until jurong point, crazy.. available only ...1 ok lar... joking wif u oni...2 free entry in 2 a wkly comp to win fa cup fina...3 u dun say so early hor... u c already then say...4 nah i don't think he goes to usf, he lives aro...Name: text, dtype: object在我word_tokenized用 nltk之后data['new'].head(5)Out[89]: 0 [go, until, jurong, point, ,, crazy.., availab...1 [ok, lar, ..., joking, wif, u, oni, ...]2 [free, entry, in, 2, a, wkly, comp, to, win, f...3 [u, dun, say, so, early, hor, ..., u, c, alrea...4 [nah, i, do, n't, think, he, goes, to, usf, ,,...Name: new, dtype: object回溯runfile('D:/python projects/NLP_nltk_first.py', wdir='D:/python projects')Traceback (most recent call last): File "D:\python projects\NLP_nltk_first.py", line 36, in <module> data['new'] = [new for new in data['new'] for word in new if word not in stopwords] File "C:\Users\Ramadhina\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3487, in __setitem__ self._set_item(key, value)

查看完整描述

1 回答

子衿沉夜

TA貢獻1828條經驗獲得超3個贊

仔細閱讀錯誤信息：

ValueError：值的長度與索引的長度不匹配

在這種情況下，“值”是右邊的東西=：

values = [word for new in data['new'] for word in new if word not in stopwords]

本例中的“索引”是 DataFrame 的行索引：

index = data.index

這里index的行數始終與 DataFrame 本身的行數相同。

問題是values對于index- 即它們對于 DataFrame 來說太長了。如果你檢查你的代碼，這應該是顯而易見的。如果您仍然看不到問題，請嘗試以下操作：

data['text_tokenized'] = [word_tokenize(row) for row in data['text']]

values = [word for new in data['text_tokenized'] for word in new if word not in stopwords]

print('N rows:', data.shape[0])

print('N new values:', len(values))

至于如何解決問題——這完全取決于您要達到的目標。一種選擇是“分解”數據（還要注意使用.map而不是列表理解）：

data['text_tokenized'] = data['text'].map(word_tokenize)

# Flatten the token lists without a nested list comprehension

tokens_flat = data['text_tokenized'].explode()

# Join your labels w/ your flattened tokens, if desired

data_flat = data[['label']].join(tokens_flat)

# Add a 2nd index level to track token appearance order,

# might make your life easier

data_flat['token_id'] = data.groupby(level=0).cumcount()

data_flat = data_flat.set_index('token_id', append=True)

作為一個不相關的提示，您可以通過僅加載您需要的列來提高 CSV 處理的效率，如下所示：

data = pd.read_csv(r"D:/python projects/read_files/spam.csv",

encoding="latin-1",

usecols=["v1", "v2"])

反對回復 2022-11-29

1 回答
0 關注
264 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

ValueError：值的長度與嵌套循環中的索引長度不匹配

ValueError：值的長度與嵌套循環中的索引長度不匹配

1 回答

添加回答