已解決430363個問題，去搜搜看，總會有你想問的

我的刪除@user 和標點符號的代碼不起作用

首頁猿問我的刪除@user...

我的刪除@user 和標點符號的代碼不起作用

Python

慕標5832272 2022-12-20 12:34:08

我為推文數據集編寫了下面的代碼，我想進行預處理，我刪除了#，網站但是我的刪除@user 和標點符號的代碼不起作用，我是 python 的新手，有人可以幫助我嗎？from nltk.corpus import stopwordsimport spacy, renlp = spacy.load('en')stop_words = [w.lower() for w in stopwords.words()]def sanitize(input_string): """ Sanitize one string """ # normalize to lowercase string = input_string.lower() # spacy tokenizer string_split = [token.text for token in nlp(string)] # in case the string is empty if not string_split: return '' names = re.compile('@[A-Za-z0-9_][A-Za-z0-9_]+') string = [re.sub(names, '@USER', tweet) for tweet in input_string()] #remove # and @ for punc in '":!#': string = string.replace(punc, '') # remove 't.co/' links string = re.sub(r'http//t.co\/[^\s]+', '', string, flags=re.MULTILINE) # removing stop words string = ' '.join([w for w in string.split() if w not in stop_words])#punctuation # string = [''.join(w for w in string.split() if w not in #string.punctuation) for w in string] return string list = ['@Jeff_Atwood Thank you for #stackoverflow', 'All hail @Joel_Spolsky t.co/Gsb7V1oVLU #stackoverflow' ]list_sanitized = [sanitize(string) for string in tweets[:300]]list_sanitized[:50]

查看完整描述

2 回答

千萬里不及你

TA貢獻1784條經驗獲得超9個贊

正則表達式需要修復。嘗試類似的東西：

names = re.compile('@[A-Za-z0-9_]+')
string = re.sub(names, '@USER', input_string)

input_string是一個變量而不是一個函數，它也是一個單數字符串，所以你不想遍歷它。這將在這里顯示得很好：https ://regexr.com/55u44

您的標點符號刪除工作正常，請參閱：https ://ideone.com/zScVPJ

反對回復 2022-12-20

Helenr

TA貢獻1780條經驗獲得超4個贊

試試這個：string = [names.sub('@USER', tweet) for tweet in input_string()]

反對回復 2022-12-20

2 回答
0 關注
116 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

我的刪除@user 和標點符號的代碼不起作用

我的刪除@user 和標點符號的代碼不起作用

2 回答

添加回答