將句子拆分為其組成詞和標點符號的列表的代碼是什么?大多數文本預處理程序傾向于刪除標點符號。例如,如果我輸入:"Punctuations to be included as its own unit."期望的輸出是:結果 = ['標點符號', 'to', 'be', '包含', 'as', '它', '自己', '單位', '.']非常感謝!
2 回答

慕村9548890
TA貢獻1884條經驗 獲得超4個贊
您可能需要考慮使用自然語言工具包或nltk.
嘗試這個:
import nltk
sentence = "Punctuations to be included as its own unit."
tokens = nltk.word_tokenize(sentence)
print(tokens)
輸出:['Punctuations', 'to', 'be', 'included', 'as', 'its', 'own', 'unit', '.']

郎朗坤
TA貢獻1921條經驗 獲得超9個贊
下面的代碼片段可以使用正則表達式來分隔列表中的單詞和標點符號。
import string
import re
punctuations = string.punctuation
regularExpression="[\w]+|" + "[" + punctuations + "]"
content="Punctuations to be included as its own unit."
splittedWords_Puncs = re.findall(r""+regularExpression, content)
print(splittedWords_Puncs)
輸出:['標點符號', 'to', 'be', 'included', 'as', 'its', 'own', 'unit', '.']
添加回答
舉報
0/150
提交
取消