已解決430363個問題，去搜搜看，總會有你想問的

在 Python 中去除 \n、\、\t、\xa0、â\x80\x93 字符文本的最快方法

首頁猿問在 Python 中去除...

在 Python 中去除 \n、\、\t、\xa0、â\x80\x93 字符文本的最快方法

Python

哈士奇WWW 2022-07-05 15:45:37

我正在使用 beautifulsoup 轉換 html 數據，收集“p”標簽中的所有文本并將其轉換為字符串。我這樣做是使用：source = BeautifulSoup(response.text, "html.parser")content = ""for section in source.findAll('p'): content += section.get_text()但是，當我轉換它時，上面提到的標簽分散在整個字符串中。我嘗試了多種方法從我正在使用的字符串中刪除所有這些字符，例如：unicodedata.normalize('NFKC', text)content = u" ".join(content.split())text.strip(), text.rstrip()是否有可以從字符串中刪除這些標簽的庫。其中一些方法解決了一些問題，但大多數仍然存在。編輯：這是一個字符串示例：https ://pastebin.com/2DGECKXa

查看完整描述

2 回答

搖曳的薔薇

TA貢獻1793條經驗獲得超6個贊

您可以使用該.replace方法編寫一個函數來執行此操作。

unwanted_chars = ['\n', '\t', 'r', '\xa0', 'a\x80\x93'] # Edit this to include all characters you want to remove

def clean_up_text(text, unwanted_chars=unwanted_chars):

for char in unwanted_chars:

text = text.replace(char, '')

return text

然后您可以應用該功能clean_up_text來刪除所有不需要的字符。

new_text = clean_up_text(old_text)

反對回復 2022-07-05

森欄

TA貢獻1810條經驗獲得超5個贊

看看這是否有效

from simplified_scrapy.simplified_doc import SimplifiedDoc

doc = SimplifiedDoc(response.text)

content = ""

for section in doc.ps:

content += section.text

# content += section.unescape()

print (content)

反對回復 2022-07-05

2 回答
0 關注
426 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

在 Python 中去除 \n、\、\t、\xa0、â\x80\x93 字符文本的最快方法

在 Python 中去除 \n、\、\t、\xa0、â\x80\x93 字符文本的最快方法

2 回答

添加回答

在 Python 中去除 \n、\、\t、\xa0、â\x80\x93 字符文本的最快方法

在 Python 中去除 \n、\、\t、\xa0、â\x80\x93 字符文本的最快方法