首頁猿問在本地 HTML 文件上使用...

在本地 HTML 文件上使用 Python 中的 Beautiful Soup 時出現錯誤的重音字符

Html5

慕桂英4014372 2023-09-18 17:18:04

我對 Python 中的 Beautiful Soup 非常熟悉，我一直用來抓取實時網站?，F在我正在抓取本地 HTML 文件（鏈接，如果您想測試代碼），唯一的問題是重音字符沒有以正確的方式表示（在抓取實時網站時，我從未發生過這種情況）。這是代碼的簡化版本import requests, urllib.request, time, unicodedata, csvfrom bs4 import BeautifulSoupsoup = BeautifulSoup(open('AH.html'), "html.parser")tables = soup.find_all('table')titles = tables[0].find_all('tr')print(titles[55].text)打印以下輸出2:22 - Il Destino ?? Gi? Scritto (2017 ITA/ENG) [1080p] [BLUWORLD]而正確的輸出應該是2:22 - Il Destino è Già Scritto (2017 ITA/ENG) [1080p] [BLUWORLD]我尋找解決方案，閱讀了許多問題/答案并找到了這個答案，我通過以下方式實現了它import requests, urllib.request, time, unicodedata, csvfrom bs4 import BeautifulSoupimport codecsresponse = open('AH.html')content = response.read()html = codecs.decode(content, 'utf-8')soup = BeautifulSoup(html, "html.parser")但是，它運行時出現以下錯誤Traceback (most recent call last): File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True)TypeError: a bytes-like object is required, not 'str'The above exception was the direct cause of the following exception:Traceback (most recent call last): File "C:\Users\user\Desktop\score.py", line 8, in <module> html = codecs.decode(content, 'utf-8')TypeError: decoding with 'utf-8' codec failed (TypeError: a bytes-like object is required, not 'str')我想解決這個問題很容易，但是怎么辦呢？

查看完整描述

2 回答

慕姐8265434

TA貢獻1813條經驗獲得超2個贊

使用open('AH.html')使用默認編碼對文件進行解碼，該默認編碼可能不是文件的編碼。 BeautifulSoup理解 HTML 標頭，特別是以下內容表明該文件是 UTF-8 編碼的：

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

以二進制模式打開文件并BeautifulSoup計算出來：

with open("AH.html","rb") as f:
    soup = BeautifulSoup(f, 'html.parser')

有時，網站設置的編碼不正確。在這種情況下，如果您知道編碼應該是什么，您可以自己指定編碼。

with open("AH.html",encoding='utf8') as f:
    soup = BeautifulSoup(f, 'html.parser')

反對回復 2023-09-18

夢里花落0921

TA貢獻1772條經驗獲得超6個贊

from bs4 import BeautifulSoup

with open("AH.html") as f:

soup = BeautifulSoup(f, 'html.parser')

tb = soup.find("table")

for item in tb.find_all("tr")[55]:

print(item.text)

我不得不說，您的第一個代碼實際上很好并且應該可以工作。

關于第二個代碼，您正在嘗試decode str哪個是錯誤的。因為decode函數是為byte object.

我相信您正在使用Windows它的默認編碼不是cp1252的地方UTF-8。

您能否運行以下代碼：

import sys
print(sys.getdefaultencoding())
print(sys.stdin.encoding)
print(sys.stdout.encoding)
print(sys.stderr.encoding)

并檢查你的輸出是否是UTF-8或cp1252。

請注意，如果您使用VSCodewith Code-Runner，請在終端中運行您的代碼py code.py

解決方案（來自聊天）

(1) 如果您使用的是 Windows 10

打開控制面板并通過小圖標更改視圖
單擊區域
單擊管理選項卡
單擊更改系統區域設置...
勾選“Beta：使用 Unicode UTF-8...”框
單擊“確定”并重新啟動您的電腦

（2）如果你不是Windows 10或者只是不想改變之前的設置，那么在第一段代碼中改為open("AH.html")，open("AH.html", encoding="UTF-8")即寫：

from bs4 import BeautifulSoup

with open("AH.html", encoding="UTF-8") as f:

soup = BeautifulSoup(f, 'html.parser')

tb = soup.find("table")

for item in tb.find_all("tr")[55]:

print(item.text)

反對回復 2023-09-18

2 回答
0 關注
166 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

在本地 HTML 文件上使用 Python 中的 Beautiful Soup 時出現錯誤的重音字符

在本地 HTML 文件上使用 Python 中的 Beautiful Soup 時出現錯誤的重音字符

2 回答

解決方案（來自聊天）

添加回答