首頁猿問如何使用 NLTK...

如何使用 NLTK 或拓撲結構進行詞形還原

Python

三國紛爭 2022-09-13 19:43:44

我知道我的解釋很長，但我覺得有必要。希望有人有耐心和樂于助人的靈魂:)我正在做一個情感分析項目atm，我被困在預處理部分。我導入了csv文件，將其轉換為數據幀，將變量/列轉換為正確的數據類型。然后我像這樣進行了標記化，在數據幀（df_tweet1）中選擇要標記的變量（推文內容）：# Tokenizationtknzr = TweetTokenizer()tokenized_sents = [tknzr.tokenize(str(i)) for i in df_tweet1['Tweet Content']]for i in tokenized_sents: print(i)輸出是一個包含單詞（標記）的列表列表。然后我執行非索引字刪除：# Stop word removalfrom nltk.corpus import stopwordsstop_words = set(stopwords.words("english"))#add words that aren't in the NLTK stopwords listnew_stopwords = ['!', ',', ':', '&', '%', '.', '’']new_stopwords_list = stop_words.union(new_stopwords)clean_sents = []for m in tokenized_sents: stop_m = [i for i in m if str(i).lower() not in new_stopwords_list] clean_sents.append(stop_m)輸出相同，但沒有非索引字接下來的兩個步驟讓我感到困惑（詞性標記和詞形還原）。我嘗試了兩件事：1）將上一個輸出轉換為字符串列表new_test = [' '.join(x) for x in clean_sents]因為我認為這將允許我使用此代碼在一個步驟中執行這兩個步驟：from pywsd.utils import lemmatize_sentencetext = new_testlemm_text = lemmatize_sentence(text, keepWordPOS=True)我得到了這個錯誤：類型錯誤：預期的字符串或類似字節的對象2）分別執行 POS 和詞形還原。第一個使用clean_sents作為輸入的 POS：# PART-OF-SPEECH def process_content(clean_sents): try: tagged_list = [] for lst in clean_sents[:500]: for item in lst: words = nltk.word_tokenize(item) tagged = nltk.pos_tag(words) tagged_list.append(tagged) return tagged_list except Exception as e: print(str(e))output_POS_clean_sents = process_content(clean_sents)輸出是一個列表列表，其中附加了帶有標記的單詞然后我想重新修飾此輸出，但是如何呢？我嘗試了兩個模塊，但都給了我錯誤：from pywsd.utils import lemmatize_sentencelemmatized= [[lemmatize_sentence(output_POS_clean_sents) for word in s] for s in output_POS_clean_sents]# ANDfrom nltk.stem.wordnet import WordNetLemmatizerlmtzr = WordNetLemmatizer()lemmatized = [[lmtzr.lemmatize(word) for word in s] for s in output_POS_clean_sents]print(lemmatized)錯誤分別為：類型錯誤：預期的字符串或類似字節的對象屬性錯誤：“元組”對象沒有屬性“endswith”

查看完整描述

2 回答

撒科打諢

TA貢獻1934條經驗獲得超2個贊

如果您使用的是數據幀，我建議您將預處理步驟結果存儲在新列中。通過這種方式，您始終可以檢查輸出，并且始終可以創建一個列表列表，以用作一行代碼后記中模型的輸入。這種方法的另一個優點是，您可以輕松地可視化預處理線，并在需要時添加其他步驟，而不會感到困惑。

關于你的代碼，它可以被優化（例如，你可以同時執行非索引字刪除和標記化），我看到你執行的步驟有點混亂。例如，你執行多次詞形還原，也使用不同的庫，這樣做是沒有意義的。在我看來，nltk工作得很好，我個人使用其他庫來預處理推文，只是為了處理表情符號，網址和主題標簽，所有與推文特別相關的東西。

# I won't write all the imports, you get them from your code

# define new column to store the processed tweets

df_tweet1['Tweet Content Clean'] = pd.Series(index=df_tweet1.index)

tknzr = TweetTokenizer()

lmtzr = WordNetLemmatizer()

stop_words = set(stopwords.words("english"))

new_stopwords = ['!', ',', ':', '&', '%', '.', '’']

new_stopwords_list = stop_words.union(new_stopwords)

# iterate through each tweet

for ind, row in df_tweet1.iterrows():

# get initial tweet: ['This is the initial tweet']

tweet = row['Tweet Content']

# tokenisation, stopwords removal and lemmatisation all at once

# out: ['initial', 'tweet']

tweet = [lmtzr.lemmatize(i) for i in tknzr.tokenize(tweet) if i.lower() not in new_stopwords_list]

# pos tag, no need to lemmatise again after.

# out: [('initial', 'JJ'), ('tweet', 'NN')]

tweet = nltk.pos_tag(tweet)

# save processed tweet into the new column

df_tweet1.loc[ind, 'Tweet Content Clean'] = tweet

因此，總的來說，您只需要4行，一行用于獲取推文字符串，兩行用于預處理文本，另一行用于存儲推文。您可以添加額外的處理步驟，注意每個步驟的輸出（例如，標記化返回字符串列表，pos標記返回元組列表，您遇到麻煩的原因）。

如果你愿意，你可以創建一個列表列表，其中包含數據幀中的所有推文：

# out: [[('initial', 'JJ'), ('tweet', 'NN')], [second tweet], [third tweet]]

all_tweets = [tweet for tweet in df_tweet1['Tweet Content Clean']]

反對回復 2022-09-13

烙印99

TA貢獻1829條經驗獲得超13個贊

第一部分是字符串列表。需要一個字符串，因此傳遞將引發一個像您得到的錯誤。您必須單獨傳遞每個字符串，然后從每個詞根化字符串創建一個列表。所以：new_testlemmatize_sentencenew_test

text = new_test

lemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]

應該創建一個詞形符號化句子的列表。

實際上，我曾經做過一個看起來與你正在做的項目相似的項目。我做了以下函數來詞形還原字符串：

import lemmy, re

def remove_stopwords(lst):

with open('stopwords.txt', 'r') as sw:

#read the stopwords file

stopwords = sw.read().split('\n')

return [word for word in lst if not word in stopwords]

def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):

"""Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.

-- body_text: string or list of strings

-- language: language of the passed string(s), e.g. 'en', 'da' etc.

"""

if isinstance(body_text, str):

body_text = [body_text] #Convert whatever passed to a list to support passing of single string

if not hasattr(body_text, '__iter__'):

raise TypeError('Passed argument should be a sequence.')

lemmatizer = lemmy.load(language) #load lemmatizing dictionary

lemma_list = [] #list to store each lemmatized string

word_regex = re.compile('[a-zA-Z0-9??????]+') #All charachters and digits i.e. all possible words

for string in body_text:

#remove punctuation and split words

matches = word_regex.findall(string)

#split words and lowercase them unless they are all caps

lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]

#remove words that are in the stopwords file

if remove_stopwords_:

lemmatized_string = remove_stopwords(lemmatized_string)

#lemmatize each word and choose the shortest word of suggested lemmatizations

lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]

#remove words that are in the stopwords file

if remove_stopwords_:

lemmatized_string = remove_stopwords(lemmatized_string)

lemma_list.append(' '.join(lemmatized_string))

return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string

如果你愿意，你可以看看，但不要覺得有義務。如果它能幫助你得到任何想法，我會非常高興，我花了很多時間試圖自己弄清楚！

讓我知道：-）第一部分是字符串列表。需要一個字符串，因此傳遞將引發一個像您得到的錯誤。您必須單獨傳遞每個字符串，然后從每個詞根化字符串創建一個列表。所以：new_testlemmatize_sentencenew_test

text = new_test

lemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]

應該創建一個詞形符號化句子的列表。

實際上，我曾經做過一個看起來與你正在做的項目相似的項目。我做了以下函數來詞形還原字符串：

import lemmy, re

def remove_stopwords(lst):

with open('stopwords.txt', 'r') as sw:

#read the stopwords file

stopwords = sw.read().split('\n')

return [word for word in lst if not word in stopwords]

def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):

"""Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.

-- body_text: string or list of strings

-- language: language of the passed string(s), e.g. 'en', 'da' etc.

"""

if isinstance(body_text, str):

body_text = [body_text] #Convert whatever passed to a list to support passing of single string

if not hasattr(body_text, '__iter__'):

raise TypeError('Passed argument should be a sequence.')

lemmatizer = lemmy.load(language) #load lemmatizing dictionary

lemma_list = [] #list to store each lemmatized string

word_regex = re.compile('[a-zA-Z0-9??????]+') #All charachters and digits i.e. all possible words

for string in body_text:

#remove punctuation and split words

matches = word_regex.findall(string)

#split words and lowercase them unless they are all caps

lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]

#remove words that are in the stopwords file

if remove_stopwords_:

lemmatized_string = remove_stopwords(lemmatized_string)

#lemmatize each word and choose the shortest word of suggested lemmatizations

lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]

#remove words that are in the stopwords file

if remove_stopwords_:

lemmatized_string = remove_stopwords(lemmatized_string)

lemma_list.append(' '.join(lemmatized_string))

return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string

如果你愿意，你可以看看，但不要覺得有義務。如果它能幫助你得到任何想法，我會非常高興，我花了很多時間試圖自己弄清楚！

讓我知道：-）

反對回復 2022-09-13

2 回答
0 關注
127 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

如何使用 NLTK 或拓撲結構進行詞形還原

如何使用 NLTK 或拓撲結構進行詞形還原

2 回答

添加回答