課程
/后端開發
/Python
/Python數據預處理(二)- 清洗文本數據
源碼能分享下嗎
2019-11-13
源自:Python數據預處理(二)- 清洗文本數據 1-7
正在回答
""" Description:正則清洗HTML數據 Author: Prompt:?code?in?python3?env """ """ ???re.I???使匹配對大小寫不敏感 ???re.L???做本地化識別(locale-aware)匹配 ???re.M???多行匹配,影響^(開頭)和$(結尾) ???re.S???匹配包含換行在內的所有字符 ???re.U???根據Unicode字符集解析字符,這個標志影響?\w,?\W,?\b,?\B ???re.X???該標志通過給予你更靈活的格式以便你將正則表達式寫得更加 """ import?re #?處理HTML標簽文本 #?@param?htmlstr?html字符串 def?filter_tags(htmlstr): ???#?過濾doc_type ???htmlstr?=?'?'.join(htmlstr.split()) ???re_doctype?=?re.compile(r'<!DOCTYPE?.*?>',?re.S) ???res?=?re_doctype.sub('',?htmlstr) ???#?過濾CDATA ???re_cdata?=?re.compile(?r'//<!CDATA\[[?>]?//\]?>',?re.I) ???res?=?re_cdata.sub('',?res) ???#?Script ???re_script?=?re.compile('<\s*script[^>]*>[^<]*<\s*/\s*script\s*>',?re.I) ???res?=?re_script.sub('',?res) ???#?注釋 ???re_script?=?re.compile('<!--.*?-->',?0) ???res?=?re_script.sub('',?res) ???#?換行符 ???re_br?=?re.compile('<br\n*?/?>') ???res?=?re_br.sub('\n',?res) ???#?HTML?標簽 ???re_lable?=?re.compile('</?\w[^>]*>') ???res?=?re_lable.sub('',?res) ???#?轉義字符 ???re_esc?=?re.compile('&.*?;') ???res?=?re_esc.sub('',?res) ???#?空格處理 ???re_blank?=?re.compile('\s+')?#?\s包含?\t?\n?\r?\f?\v ???res?=?re_blank.sub('?',?res) ???#?超鏈接處理 ???re_http?=?re.compile(r'(http://.+.html)') ???res?=?re_http.sub('?',?res) ???d?=?lambda?pattern,?flags=0:?re.compile(pattern,?flags) ???for?re_type?in?re_mate: ??????re_type?=?d(*re_type) ??????res?=?re_type.sub('?',?res) ???return?res def?read_file(read_path): ???str_doc?=?'' ???with?open(read_path,?'r',?encoding='utf-8')?as?f: ??????str_doc?=?f.read() ???return?str_doc if?__name__?==?'__main__': ???str_doc?=?read_file(r'../data/html/re.html') ???res?=?filter_tags(str_doc) ???#?print(res) ???with?open(r'../data/html/test.html',?'w',?encoding='utf-8')?as?f: ??????f.write(res) ???print('No?Exception')?#?我是通過另一個編輯器進行打開預覽的
這是我的筆記
weixin_慕尼黑7100639
https://github.com/bainingchao/DataProcess
舉報
教會你使用Python數據預處理
3 回答求源碼下載
1 回答源碼下載問題
1 回答怎么沒有源代碼
1 回答老師,代碼有嗎
1 回答正則過濾掉特殊符號、標點、英文、數字等這段代碼可以提供一下嗎
Copyright ? 2025 imooc.com All Rights Reserved | 京ICP備12003892號-11 京公網安備11010802030151號
購課補貼聯系客服咨詢優惠詳情
慕課網APP您的移動學習伙伴
掃描二維碼關注慕課網微信公眾號
2020-02-09
""" Description:正則清洗HTML數據 Author: Prompt:?code?in?python3?env """ """ ???re.I???使匹配對大小寫不敏感 ???re.L???做本地化識別(locale-aware)匹配 ???re.M???多行匹配,影響^(開頭)和$(結尾) ???re.S???匹配包含換行在內的所有字符 ???re.U???根據Unicode字符集解析字符,這個標志影響?\w,?\W,?\b,?\B ???re.X???該標志通過給予你更靈活的格式以便你將正則表達式寫得更加 """ import?re #?處理HTML標簽文本 #?@param?htmlstr?html字符串 def?filter_tags(htmlstr): ???#?過濾doc_type ???htmlstr?=?'?'.join(htmlstr.split()) ???re_doctype?=?re.compile(r'<!DOCTYPE?.*?>',?re.S) ???res?=?re_doctype.sub('',?htmlstr) ???#?過濾CDATA ???re_cdata?=?re.compile(?r'//<!CDATA\[[?>]?//\]?>',?re.I) ???res?=?re_cdata.sub('',?res) ???#?Script ???re_script?=?re.compile('<\s*script[^>]*>[^<]*<\s*/\s*script\s*>',?re.I) ???res?=?re_script.sub('',?res) ???#?注釋 ???re_script?=?re.compile('<!--.*?-->',?0) ???res?=?re_script.sub('',?res) ???#?換行符 ???re_br?=?re.compile('<br\n*?/?>') ???res?=?re_br.sub('\n',?res) ???#?HTML?標簽 ???re_lable?=?re.compile('</?\w[^>]*>') ???res?=?re_lable.sub('',?res) ???#?轉義字符 ???re_esc?=?re.compile('&.*?;') ???res?=?re_esc.sub('',?res) ???#?空格處理 ???re_blank?=?re.compile('\s+')?#?\s包含?\t?\n?\r?\f?\v ???res?=?re_blank.sub('?',?res) ???#?超鏈接處理 ???re_http?=?re.compile(r'(http://.+.html)') ???res?=?re_http.sub('?',?res) ???d?=?lambda?pattern,?flags=0:?re.compile(pattern,?flags) ???for?re_type?in?re_mate: ??????re_type?=?d(*re_type) ??????res?=?re_type.sub('?',?res) ???return?res def?read_file(read_path): ???str_doc?=?'' ???with?open(read_path,?'r',?encoding='utf-8')?as?f: ??????str_doc?=?f.read() ???return?str_doc if?__name__?==?'__main__': ???str_doc?=?read_file(r'../data/html/re.html') ???res?=?filter_tags(str_doc) ???#?print(res) ???with?open(r'../data/html/test.html',?'w',?encoding='utf-8')?as?f: ??????f.write(res) ???print('No?Exception')?#?我是通過另一個編輯器進行打開預覽的這是我的筆記
2020-01-24
https://github.com/bainingchao/DataProcess