我正在嘗試從網站https://www.cellartracker.com/m/wines/12344 中抓取一些數據。我無法理解如何獲取不屬于標簽中任何類的每個值。以下是我正在尋找的網站代碼:<ul class="twin-set-list"> <li><span>Vintage</span> 2000</li> <li><span>Type</span> Red</li> <li><span>Producer</span> Balnaves of Coonawarra</li> <li><span>Varietal</span> Cabernet Sauvignon</li> <li><span>Designation</span> The Tally Reserve</li> <li><span>Vineyard</span> n/a</li> <li><span>Country</span> Australia</li> <li><span>Region</span> South Australia</li> <li><span>SubRegion</span> Limestone Coast</li> <li><span>Appellation</span> Coonawarra</li> </ul>像 2000、Red 等值沒有任何類,所以我可以使用什么方式來獲取數據。我在 python 中嘗試了以下代碼(下面僅給出了 html 部分): from bs4 import BeautifulSouphtml = """<ul class="twin-set-list"> <li><span>Vintage</span> 2000</li> <li><span>Type</span> Red</li> <li><span>Producer</span> Balnaves of Coonawarra</li> <li><span>Varietal</span> Cabernet Sauvignon</li> <li><span>Designation</span> The Tally Reserve</li> <li><span>Vineyard</span> n/a</li> <li><span>Country</span> Australia</li> <li><span>Region</span> South Australia</li> <li><span>SubRegion</span> Limestone Coast</li> <li><span>Appellation</span> Coonawarra</li> </ul>"""soup = BeautifulSoup(html, 'html.parser')need = {}for li_tag in soup.find_all('ul', {'class':'twin-set-list'}): for span_tag in li_tag.find_all('li'): field = span_tag.find('span').text value = span_tag.find('span').text need[field] = valueprint(need)誰能建議我如何提取這些數據?
3 回答

狐的傳說
TA貢獻1804條經驗 獲得超3個贊
您可以通過以下方式替換您的代碼:
field = span_tag.find('span').text
value = span_tag.text.replace(field,'')
它不是很干凈,但它適用于您的代碼。

慕桂英4014372
TA貢獻1871條經驗 獲得超13個贊
您可以遍歷對象的contents
屬性bs4
:
from bs4 import BeautifulSoup as soup d = [[getattr(c, 'text', c).strip() for c in i] for i in soup(html, 'html.parser').find_all('li')]
輸出:
[['Vintage', '2000'], ['Type', 'Red'], ['Producer', 'Balnaves of Coonawarra'], ['Varietal', 'Cabernet Sauvignon'], ['Designation', 'The Tally Reserve'], ['Vineyard', 'n/a'], ['Country', 'Australia'], ['Region', 'South Australia'], ['SubRegion', 'Limestone Coast'], ['Appellation', 'Coonawarra']]

一只甜甜圈
TA貢獻1836條經驗 獲得超5個贊
也許你可以試試這個:
for li_tag in soup.find_all('ul', {'class':'twin-set-list'}):
for span_tag in li_tag.find_all('li'):
field = span_tag.find('span').text
value = span_tag.text
value = value[len(field)+1:]
need[field] = value
以防萬一,如果您在“值”中有相同的字段,請不要替換它,而是使用 subtring。
添加回答
舉報
0/150
提交
取消