已解決430363個問題，去搜搜看，總會有你想問的

BeautifulSoup 找不到所有 div 標簽

首頁猿問 BeautifulSoup...

BeautifulSoup 找不到所有 div 標簽

Html5

慕桂英4014372 2023-10-30 19:47:12

我已經開始了一個私人項目：在 Visual Studio Code (1.41.0) 中使用 Python 和 BeautifulSoup 進行網頁抓取。我能夠抓取與我的“問題網站”具有相同結構的另一個網站。然而現在我遇到了，BeautifulSoup 沒有找到所有 div 標簽（每個站點應該有 20 個，而我只找到了其中 3 個）。<div class="css-15dj4ut"></div>我從中得到了所有<div class="css-fh99y9 excbu0j0">...</div>，但沒有從中得到<div class="css-roynbj excbu0j0"></div>。你知道為什么嗎？迭代每個 url 以訪問每個站點。for i in range(0, endIndex):try:? ? if i == 0:? ? ? ? urls.append(basicUrl)? ? ? ? page = urllib.request.urlopen(urls[i])? ? ? ? soup = BeautifulSoup(page, 'html.parser')? ? ? ? getSurgeonName(soup)? ? else:? ? ? ? urls.append(basicUrl + urlAddon + str(i + 1))? ? ? ? page = urllib.request.urlopen(urls[i])? ? ? ? soup = BeautifulSoup(page, 'html.parser')? ? ? ? getSurgeonName(soup)except:? ? print("An URL request error occured.")函數版本1：def getSurgeonName(soup):? ? # gets just first 3 surgeons of site? ? docName = re.compile('css-15dj4ut')? ? docNameTags = soup.find_all('div', attrs={'class': docName})? ? for a in docNameTags:? ? ? ? ? ? docNameList.append(a.getText())功能版本2：def getSurgeonName(soup):? ? parentClass = re.compile('css-fh99y9 excbu0j0')? ? parentItems = soup.find_all('div', attrs={'class': parentClass})? ? for parent in parentItems:? ? ? ? ? ?children = parent.findChildren('div', {"class": "css-15dj4ut"})?? ? ? ? ? ?docNameList.append(children[0].getText())? ? parentClass = re.compile('css-roynbj excbu0j0')? ? parentItems = soup.find_all('div', attrs={'class': parentClass})? ? for parent in parentItems:? ? ? ? ? ?children = parent.findChildren('div', {'class': 'css-15dj4ut'})?? ? ? ? ? ?docNameList.append(children[0].getText())

查看完整描述

1 回答

大話西游666

TA貢獻1817條經驗獲得超14個贊

實際上，您所需的desired數據是通過JavaScript頁面加載動態加載的，因此requests包將無法JavaScript動態渲染。但我已經能夠找到script保存數據的標簽，然后將其加載到string中。JSON dictJSON

在這里你可以解析任何你想要的:)。

import requests

from bs4 import BeautifulSoup

import json

r = requests.get("https://www.comparis.ch/gesundheit/arzt/pathologie")

soup = BeautifulSoup(r.content, 'html.parser')

script = soup.find("script", {'id': '__NEXT_DATA__'}).text

data = json.loads(script)

print(data.keys()) # JSON Dict

dumper = json.dumps(data, indent=4)

print(dumper) # to see it in human readble format

就像是：

for item in data['props']['pageProps']['doctorResults']['doctorModels']:

print(item['name'])

輸出：

Mohamed Abdou

Dr. med. Heiner Adams

Dr. med. Franziska Aebersold

Prof. Dr. med. Adriano Aguzzi

Dr. med. Maria Ammann

Prosper Anani

Dr. med. Max Arnaboldi

Dr. med. Walter Arnold

Dr. med. Irena Baltisser

Dr. med. Fridolin Bannwart

Dr. med. Yara Banz

Dr. med. André Barghorn

Dr. Jessica Barizzi

Prof. Dr. med. Daniel Baumhoer

Audrey Baur Chaubert

Dr. med. Christian Georg Bayerl

Dr. med. Marc Beer

Dr. med. Sabina Berezowska

Dr. med. Steffen Bergelt

Dr. med. Barbara Elisabeth Berger-Denzler

反對回復 2023-10-30

1 回答
0 關注
219 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

BeautifulSoup 找不到所有 div 標簽

BeautifulSoup 找不到所有 div 標簽

1 回答

添加回答