2 回答

TA貢獻1934條經驗 獲得超2個贊
如果您使用的是數據幀,我建議您將預處理步驟結果存儲在新列中。通過這種方式,您始終可以檢查輸出,并且始終可以創建一個列表列表,以用作一行代碼后記中模型的輸入。這種方法的另一個優點是,您可以輕松地可視化預處理線,并在需要時添加其他步驟,而不會感到困惑。
關于你的代碼,它可以被優化(例如,你可以同時執行非索引字刪除和標記化),我看到你執行的步驟有點混亂。例如,你執行多次詞形還原,也使用不同的庫,這樣做是沒有意義的。在我看來,nltk工作得很好,我個人使用其他庫來預處理推文,只是為了處理表情符號,網址和主題標簽,所有與推文特別相關的東西。
# I won't write all the imports, you get them from your code
# define new column to store the processed tweets
df_tweet1['Tweet Content Clean'] = pd.Series(index=df_tweet1.index)
tknzr = TweetTokenizer()
lmtzr = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))
new_stopwords = ['!', ',', ':', '&', '%', '.', '’']
new_stopwords_list = stop_words.union(new_stopwords)
# iterate through each tweet
for ind, row in df_tweet1.iterrows():
# get initial tweet: ['This is the initial tweet']
tweet = row['Tweet Content']
# tokenisation, stopwords removal and lemmatisation all at once
# out: ['initial', 'tweet']
tweet = [lmtzr.lemmatize(i) for i in tknzr.tokenize(tweet) if i.lower() not in new_stopwords_list]
# pos tag, no need to lemmatise again after.
# out: [('initial', 'JJ'), ('tweet', 'NN')]
tweet = nltk.pos_tag(tweet)
# save processed tweet into the new column
df_tweet1.loc[ind, 'Tweet Content Clean'] = tweet
因此,總的來說,您只需要4行,一行用于獲取推文字符串,兩行用于預處理文本,另一行用于存儲推文。您可以添加額外的處理步驟,注意每個步驟的輸出(例如,標記化返回字符串列表,pos標記返回元組列表,您遇到麻煩的原因)。
如果你愿意,你可以創建一個列表列表,其中包含數據幀中的所有推文:
# out: [[('initial', 'JJ'), ('tweet', 'NN')], [second tweet], [third tweet]]
all_tweets = [tweet for tweet in df_tweet1['Tweet Content Clean']]

TA貢獻1829條經驗 獲得超13個贊
第一部分是字符串列表。 需要一個字符串,因此傳遞將引發一個像您得到的錯誤。您必須單獨傳遞每個字符串,然后從每個詞根化字符串創建一個列表。所以:new_testlemmatize_sentencenew_test
text = new_test
lemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]
應該創建一個詞形符號化句子的列表。
實際上,我曾經做過一個看起來與你正在做的項目相似的項目。我做了以下函數來詞形還原字符串:
import lemmy, re
def remove_stopwords(lst):
with open('stopwords.txt', 'r') as sw:
#read the stopwords file
stopwords = sw.read().split('\n')
return [word for word in lst if not word in stopwords]
def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):
"""Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.
-- body_text: string or list of strings
-- language: language of the passed string(s), e.g. 'en', 'da' etc.
"""
if isinstance(body_text, str):
body_text = [body_text] #Convert whatever passed to a list to support passing of single string
if not hasattr(body_text, '__iter__'):
raise TypeError('Passed argument should be a sequence.')
lemmatizer = lemmy.load(language) #load lemmatizing dictionary
lemma_list = [] #list to store each lemmatized string
word_regex = re.compile('[a-zA-Z0-9??????]+') #All charachters and digits i.e. all possible words
for string in body_text:
#remove punctuation and split words
matches = word_regex.findall(string)
#split words and lowercase them unless they are all caps
lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]
#remove words that are in the stopwords file
if remove_stopwords_:
lemmatized_string = remove_stopwords(lemmatized_string)
#lemmatize each word and choose the shortest word of suggested lemmatizations
lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]
#remove words that are in the stopwords file
if remove_stopwords_:
lemmatized_string = remove_stopwords(lemmatized_string)
lemma_list.append(' '.join(lemmatized_string))
return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string
如果你愿意,你可以看看,但不要覺得有義務。如果它能幫助你得到任何想法,我會非常高興,我花了很多時間試圖自己弄清楚!
讓我知道:-)第一部分是字符串列表。 需要一個字符串,因此傳遞將引發一個像您得到的錯誤。您必須單獨傳遞每個字符串,然后從每個詞根化字符串創建一個列表。所以:new_testlemmatize_sentencenew_test
text = new_test
lemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]
應該創建一個詞形符號化句子的列表。
實際上,我曾經做過一個看起來與你正在做的項目相似的項目。我做了以下函數來詞形還原字符串:
import lemmy, re
def remove_stopwords(lst):
with open('stopwords.txt', 'r') as sw:
#read the stopwords file
stopwords = sw.read().split('\n')
return [word for word in lst if not word in stopwords]
def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):
"""Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.
-- body_text: string or list of strings
-- language: language of the passed string(s), e.g. 'en', 'da' etc.
"""
if isinstance(body_text, str):
body_text = [body_text] #Convert whatever passed to a list to support passing of single string
if not hasattr(body_text, '__iter__'):
raise TypeError('Passed argument should be a sequence.')
lemmatizer = lemmy.load(language) #load lemmatizing dictionary
lemma_list = [] #list to store each lemmatized string
word_regex = re.compile('[a-zA-Z0-9??????]+') #All charachters and digits i.e. all possible words
for string in body_text:
#remove punctuation and split words
matches = word_regex.findall(string)
#split words and lowercase them unless they are all caps
lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]
#remove words that are in the stopwords file
if remove_stopwords_:
lemmatized_string = remove_stopwords(lemmatized_string)
#lemmatize each word and choose the shortest word of suggested lemmatizations
lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]
#remove words that are in the stopwords file
if remove_stopwords_:
lemmatized_string = remove_stopwords(lemmatized_string)
lemma_list.append(' '.join(lemmatized_string))
return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string
如果你愿意,你可以看看,但不要覺得有義務。如果它能幫助你得到任何想法,我會非常高興,我花了很多時間試圖自己弄清楚!
讓我知道:-)
添加回答
舉報