首頁猿問刪除 HTML 標簽并將 JSON...

刪除 HTML 標簽并將 JSON 數組解析為鍵/值對象

Python

炎炎設計 2023-10-31 15:22:59

我正在使用 JSON 數組有效負載，我想將其提取到一個單獨的對象中以進行下游處理。有效負載是動態的，并且可以在 JSON 數組中具有多個嵌套級別，但第一級始終有一個id作為唯一標識符的字段。[{'field1': [], 'field2': {'field2_1': string, 'field2_2': 'string', 'field2_3': 'string'}, 'field3': '<html> strings <html>'}, 'id':1}{'field1': [], 'field2': {'field2_1': string, 'field2_2': 'string', 'field2_3': 'string'}, 'field3': '<html> strings <html>'}, 'id':2}{'field1': [], 'field2': {'field2_1': string, 'field2_2': 'string', 'field2_3': 'string'}, 'field3': '<html> strings <html>'}, 'id':3},{'field1': [], 'field2': {'field2_1': string, 'field2_2': 'string', 'field2_3': 'string'}, 'field3': '<html> strings <html>'}, 'id':4}]有效負載在更多字段或具有不同類型數據的更多嵌套字段方面不限于此結構。但該id字段將始終附加到有效負載中的每個對象。我想創建一個字典（對數據類型的其他建議開放），其中該id字段和該對象中的其他所有內容都作為清理后的字符串，沒有任何括號或 HTML 標簽等。輸出應該是這樣的（取決于數據類型）：{1: string string string strings,2: string string string strings,3: string string string strings,4: string string string strings}這是一個非常通用的例子。我在使用所有嵌套和內容導航 JSON 數組時遇到問題，只想以id干凈的方式提取內容和其余內容。任何幫助表示贊賞！

查看完整描述

1 回答

白板的微信

TA貢獻1883條經驗獲得超3個贊

您可以使用它beautifulsoup來清理所有標簽中的字符串。例如：

from bs4 import BeautifulSoup

lst = [{'field1': [],

'field2': {'field2_1': 'string1',

'field2_2': 'string2',

'field2_3': 'string3'},

'field3': '<html> strings4 <html>',

'id':1},

{'field1': [],

'field2': {'field2_1': 'string1',

'field2_2': 'string2',

'field2_3': 'string3'},

'field3': '<html> strings4 <html>',

'id':2},

{'field1': [],

'field2': {'field2_1': 'string1',

'field2_2': 'string2',

'field2_3': 'string3'},

'field3': '<html> strings4 <html>',

'id':3},

{'field1': [],

'field2': {'field2_1': 'string1',

'field2_2': 'string2',

'field2_3': 'string3'},

'field3': '<html> strings4 <html>',

'id':4}]

def flatten(d):

if isinstance(d, dict):

for v in d.values():

yield from flatten(v)

elif isinstance(d, list):

for v in d:

yield from flatten(v)

elif isinstance(d, str):

yield d

out = {}

for d in lst:

out[d['id']] = ' '.join(map(str.strip, BeautifulSoup(' '.join(flatten(d)), 'html.parser').find_all(text=True)))

print(out)

印刷：

{1: 'string1 string2 string3 strings4', 2: 'string1 string2 string3 strings4', 3: 'string1 string2 string3 strings4', 4: 'string1 string2 string3 strings4'}

反對回復 2023-10-31

1 回答
0 關注
129 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

刪除 HTML 標簽并將 JSON 數組解析為鍵/值對象

刪除 HTML 標簽并將 JSON 數組解析為鍵/值對象

1 回答

添加回答