首頁猿問使用 Beautiful Soup...

使用 Beautiful Soup 從特定腳本標簽中查找嵌套的 JS 對象值

JavaScript

蠱毒傳說 2022-05-22 10:31:58

我正在用漂亮的湯抓取一個網站來抓取圖像，到目前為止，這對每個網站都很好，我什至設法創建了一些自定義案例類型。但是一個特定的站點給我帶來了問題，因為它返回了一個 JavaScript 對象中的所有圖像，該對象內嵌在一個腳本標記中。該對象非常大，因為它包含所有產品信息，我正在尋找的特定位嵌套在 productArticleDetails > [產品 id] > normalImages > thumbnail > [圖像路徑] 中。像這樣：<script>var productArticleDetails = { ... '0399310001': { ... 'normalImages': [ { 'thumbnail': '//image-path.jpg', ... } ] }} 所以我只想提取圖像路徑。它也不是返回的“湯”中包含在腳本標記中的唯一內容，代碼中還有許多其他 javascript 標記。到目前為止，我已將 HTML 保存到一個變量中，然后運行：soup = BeautifulSoup(html)scripts = soup.find_all('script')所以我留下了一個包含所有<script>元素的對象html不知何故，在該scripts對象中，我需要在正確的 JS 塊中找到該特定節點并返回thumbnail嵌套在該節點下的normalImages節點的值，該節點又將嵌套在一串數字下方，最終全部保存到productArticleDetailsvar .我想我需要對對象進行for循環，scripts但沒有運氣弄清楚如何提取特定的數據位。我所看到的其他所有內容都假設只有 1 位 javaScript 并且您要查找的值不是嵌套的。任何人都可以幫忙嗎？干杯。

查看完整描述

2 回答

蕪湖不蕪

TA貢獻1796條經驗獲得超7個贊

如果您可以做一個簡化的假設，例如，您要解析的對象}與行首的最終齊平，這很容易：

import ast

import re

from bs4 import BeautifulSoup

html = """

// we don't care about this script tag

</script>

var productArticleDetails = {

'0399310001': {

'normalImages': [

{

'thumbnail': '//image-path.jpg',

}

]

}

var someOtherThing = 42;

</script>

"""

soup = BeautifulSoup(html, "lxml")

for script in soup.find_all("script"):

pattern = r"^var productArticleDetails = (.+?^})"

if m := re.search(pattern, script.text, re.M | re.S):

data = ast.literal_eval(m.group(1))

break

print(data["0399310001"]["normalImages"][0]["thumbnail"])

輸出：

//image-path.jpg

但是，如果你不能做出這個假設，也許你可以做出不同的假設，比如“把所有東西都拿起來，直到下一個空行作為對象”：

pattern = r"^var productArticleDetails = (.+?^\s*$)"

如果這仍然太脆弱并且對象可能是任何形式，那么我們就會遇到正則表達式不適合的平衡括號檢測問題。您可以使用堆棧來確定對象何時結束（如果數據包含}內部字符串，請小心，但這是一個可導航的解析問題）。

請注意，ast.literal_eval()如果 JS 對象的鍵周圍沒有引號，則會失敗，因此您可能還需要為這種情況做一些準備。目前尚不清楚這是否是您需要的靜態一次性解析，或者您是否正在尋找可以承受任何 JS 對象格式的強大解決方案。

json.loads在這里非常沒用，因為它假定 JSON 格式完美。JS 對象幾乎從不采用這種形式，如此處所示。

反對回復 2022-05-22

慕尼黑的夜晚無繁華

TA貢獻1864條經驗獲得超6個贊

import json

from bs4 import BeautifulSoup

html = """<script type="application/ld+json">

var productArticleDetails = {

"@context" : "https://schema.org",

"@type" : "BreadcrumbList",

"itemListElement": [ {"@type":"ListItem","thumbnail":"//image-path.jpg","item":{"@id":"https://www.myntra.com/","name":"Home"}},{"@type":"ListItem","position":2,"item":{"@id":"https://www.myntra.com/clothing","name":"Clothing"}},{"@type":"ListItem","position":3,"item":{"@id":"https://www.myntra.com/men-clothing","name":"Men Clothing"}},{"@type":"ListItem","position":4,"item":{"@id":"https://www.myntra.com/shirts","name":"Shirts"}},{"@type":"ListItem","position":5,"item":{"@id":"https://www.myntra.com/formal-shirts-for-men","name":"Formal Shirts For Men"}} ]

}

</script>"""

soup = BeautifulSoup(html, 'html.parser')

sc = soup.find("script").text

data = sc.split("=", 1)[1]

ld = json.loads(data)

# print(json.dumps(ld, indent=4))

print(ld["itemListElement"][0]["thumbnail"])

輸出：

//image-path.jpg

反對回復 2022-05-22

2 回答
0 關注
411 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

使用 Beautiful Soup 從特定腳本標簽中查找嵌套的 JS 對象值

使用 Beautiful Soup 從特定腳本標簽中查找嵌套的 JS 對象值

2 回答

添加回答