亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

如何使用 NLTK 或拓撲結構進行詞形還原

如何使用 NLTK 或拓撲結構進行詞形還原

三國紛爭 2022-09-13 19:43:44
我知道我的解釋很長,但我覺得有必要。希望有人有耐心和樂于助人的靈魂:)我正在做一個情感分析項目atm,我被困在預處理部分。我導入了csv文件,將其轉換為數據幀,將變量/列轉換為正確的數據類型。然后我像這樣進行了標記化,在數據幀(df_tweet1)中選擇要標記的變量(推文內容):# Tokenizationtknzr = TweetTokenizer()tokenized_sents = [tknzr.tokenize(str(i)) for i in df_tweet1['Tweet Content']]for i in tokenized_sents:    print(i)輸出是一個包含單詞(標記)的列表列表。然后我執行非索引字刪除:# Stop word removalfrom nltk.corpus import stopwordsstop_words = set(stopwords.words("english"))#add words that aren't in the NLTK stopwords listnew_stopwords = ['!', ',', ':', '&', '%', '.', '’']new_stopwords_list = stop_words.union(new_stopwords)clean_sents = []for m in tokenized_sents:    stop_m = [i for i in m if str(i).lower() not in new_stopwords_list]    clean_sents.append(stop_m)輸出相同,但沒有非索引字接下來的兩個步驟讓我感到困惑(詞性標記和詞形還原)。我嘗試了兩件事:1)將上一個輸出轉換為字符串列表new_test = [' '.join(x) for x in clean_sents]因為我認為這將允許我使用此代碼在一個步驟中執行這兩個步驟:from pywsd.utils import lemmatize_sentencetext = new_testlemm_text = lemmatize_sentence(text, keepWordPOS=True)我得到了這個錯誤: 類型錯誤: 預期的字符串或類似字節的對象2) 分別執行 POS 和詞形還原。第一個使用clean_sents作為輸入的 POS:# PART-OF-SPEECH        def process_content(clean_sents):    try:        tagged_list = []          for lst in clean_sents[:500]:             for item in lst:                words = nltk.word_tokenize(item)                tagged = nltk.pos_tag(words)                tagged_list.append(tagged)        return tagged_list    except Exception as e:        print(str(e))output_POS_clean_sents = process_content(clean_sents)輸出是一個列表列表,其中附加了帶有標記的單詞 然后我想重新修飾此輸出,但是如何呢?我嘗試了兩個模塊,但都給了我錯誤:from pywsd.utils import lemmatize_sentencelemmatized= [[lemmatize_sentence(output_POS_clean_sents) for word in s]              for s in output_POS_clean_sents]# ANDfrom nltk.stem.wordnet import WordNetLemmatizerlmtzr = WordNetLemmatizer()lemmatized = [[lmtzr.lemmatize(word) for word in s]              for s in output_POS_clean_sents]print(lemmatized)錯誤分別為:類型錯誤:預期的字符串或類似字節的對象屬性錯誤:“元組”對象沒有屬性“endswith”
查看完整描述

2 回答

?
撒科打諢

TA貢獻1934條經驗 獲得超2個贊

如果您使用的是數據幀,我建議您將預處理步驟結果存儲在新列中。通過這種方式,您始終可以檢查輸出,并且始終可以創建一個列表列表,以用作一行代碼后記中模型的輸入。這種方法的另一個優點是,您可以輕松地可視化預處理線,并在需要時添加其他步驟,而不會感到困惑。


關于你的代碼,它可以被優化(例如,你可以同時執行非索引字刪除和標記化),我看到你執行的步驟有點混亂。例如,你執行多次詞形還原,也使用不同的庫,這樣做是沒有意義的。在我看來,nltk工作得很好,我個人使用其他庫來預處理推文,只是為了處理表情符號,網址和主題標簽,所有與推文特別相關的東西。


# I won't write all the imports, you get them from your code

# define new column to store the processed tweets

df_tweet1['Tweet Content Clean'] = pd.Series(index=df_tweet1.index)


tknzr = TweetTokenizer()

lmtzr = WordNetLemmatizer()


stop_words = set(stopwords.words("english"))

new_stopwords = ['!', ',', ':', '&', '%', '.', '’']

new_stopwords_list = stop_words.union(new_stopwords)


# iterate through each tweet

for ind, row in df_tweet1.iterrows():


    # get initial tweet: ['This is the initial tweet']

    tweet = row['Tweet Content']


    # tokenisation, stopwords removal and lemmatisation all at once

    # out: ['initial', 'tweet']

    tweet = [lmtzr.lemmatize(i) for i in tknzr.tokenize(tweet) if i.lower() not in new_stopwords_list]


    # pos tag, no need to lemmatise again after.

    # out: [('initial', 'JJ'), ('tweet', 'NN')]

    tweet = nltk.pos_tag(tweet)


    # save processed tweet into the new column

    df_tweet1.loc[ind, 'Tweet Content Clean'] = tweet

因此,總的來說,您只需要4行,一行用于獲取推文字符串,兩行用于預處理文本,另一行用于存儲推文。您可以添加額外的處理步驟,注意每個步驟的輸出(例如,標記化返回字符串列表,pos標記返回元組列表,您遇到麻煩的原因)。


如果你愿意,你可以創建一個列表列表,其中包含數據幀中的所有推文:


# out: [[('initial', 'JJ'), ('tweet', 'NN')], [second tweet], [third tweet]]

all_tweets = [tweet for tweet in df_tweet1['Tweet Content Clean']]


查看完整回答
反對 回復 2022-09-13
?
烙印99

TA貢獻1829條經驗 獲得超13個贊

第一部分是字符串列表。 需要一個字符串,因此傳遞將引發一個像您得到的錯誤。您必須單獨傳遞每個字符串,然后從每個詞根化字符串創建一個列表。所以:new_testlemmatize_sentencenew_test


text = new_test

lemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]

應該創建一個詞形符號化句子的列表。


實際上,我曾經做過一個看起來與你正在做的項目相似的項目。我做了以下函數來詞形還原字符串:


import lemmy, re


def remove_stopwords(lst):

    with open('stopwords.txt', 'r') as sw:

        #read the stopwords file 

        stopwords = sw.read().split('\n')

        return [word for word in lst if not word in stopwords]


def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):

    """Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.


    -- body_text: string or list of strings

    -- language: language of the passed string(s), e.g. 'en', 'da' etc.

    """


    if isinstance(body_text, str):

        body_text = [body_text] #Convert whatever passed to a list to support passing of single string


    if not hasattr(body_text, '__iter__'):

        raise TypeError('Passed argument should be a sequence.')


    lemmatizer = lemmy.load(language) #load lemmatizing dictionary


    lemma_list = [] #list to store each lemmatized string 


    word_regex = re.compile('[a-zA-Z0-9??????]+') #All charachters and digits i.e. all possible words


    for string in body_text:

        #remove punctuation and split words

        matches = word_regex.findall(string)


        #split words and lowercase them unless they are all caps

        lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]


        #remove words that are in the stopwords file

        if remove_stopwords_:

            lemmatized_string = remove_stopwords(lemmatized_string)


        #lemmatize each word and choose the shortest word of suggested lemmatizations

        lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]


        #remove words that are in the stopwords file

        if remove_stopwords_:

            lemmatized_string = remove_stopwords(lemmatized_string)


        lemma_list.append(' '.join(lemmatized_string))


    return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string

如果你愿意,你可以看看,但不要覺得有義務。如果它能幫助你得到任何想法,我會非常高興,我花了很多時間試圖自己弄清楚!


讓我知道:-)第一部分是字符串列表。 需要一個字符串,因此傳遞將引發一個像您得到的錯誤。您必須單獨傳遞每個字符串,然后從每個詞根化字符串創建一個列表。所以:new_testlemmatize_sentencenew_test


text = new_test

lemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]

應該創建一個詞形符號化句子的列表。


實際上,我曾經做過一個看起來與你正在做的項目相似的項目。我做了以下函數來詞形還原字符串:


import lemmy, re


def remove_stopwords(lst):

    with open('stopwords.txt', 'r') as sw:

        #read the stopwords file 

        stopwords = sw.read().split('\n')

        return [word for word in lst if not word in stopwords]


def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):

    """Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.


    -- body_text: string or list of strings

    -- language: language of the passed string(s), e.g. 'en', 'da' etc.

    """


    if isinstance(body_text, str):

        body_text = [body_text] #Convert whatever passed to a list to support passing of single string


    if not hasattr(body_text, '__iter__'):

        raise TypeError('Passed argument should be a sequence.')


    lemmatizer = lemmy.load(language) #load lemmatizing dictionary


    lemma_list = [] #list to store each lemmatized string 


    word_regex = re.compile('[a-zA-Z0-9??????]+') #All charachters and digits i.e. all possible words


    for string in body_text:

        #remove punctuation and split words

        matches = word_regex.findall(string)


        #split words and lowercase them unless they are all caps

        lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]


        #remove words that are in the stopwords file

        if remove_stopwords_:

            lemmatized_string = remove_stopwords(lemmatized_string)


        #lemmatize each word and choose the shortest word of suggested lemmatizations

        lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]


        #remove words that are in the stopwords file

        if remove_stopwords_:

            lemmatized_string = remove_stopwords(lemmatized_string)


        lemma_list.append(' '.join(lemmatized_string))


    return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string

如果你愿意,你可以看看,但不要覺得有義務。如果它能幫助你得到任何想法,我會非常高興,我花了很多時間試圖自己弄清楚!


讓我知道:-)


查看完整回答
反對 回復 2022-09-13
  • 2 回答
  • 0 關注
  • 127 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯系客服咨詢優惠詳情

幫助反饋 APP下載

慕課網APP
您的移動學習伙伴

公眾號

掃描二維碼
關注慕課網微信公眾號