亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

網頁抓取并將檢索到的數據拆分為不同的行

網頁抓取并將檢索到的數據拆分為不同的行

揚帆大魚 2021-12-09 15:22:03
我正在嘗試收集活動日期、時間和地點。他們成功地出來了,但后來對讀者不友好。如何讓日期、時間和地點分別顯示,例如:- event  Date:  Time:  Venue:- event  Date:  Time:  Venue:我正在考慮拆分,但我最終得到了很多 [],這使它看起來更難看。我想過剝離但我的正則表達式但它似乎沒有做任何事情。有什么建議?from urllib.request import urlopenfrom bs4 import BeautifulSoupimport reurl_toscrape = "https://www.ntu.edu.sg/events/Pages/default.aspx"response = urllib.request.urlopen(url_toscrape)info_type = response.info()responseData = response.read()soup = BeautifulSoup(responseData, 'lxml')events_absFirst = soup.find_all("div",{"class": "ntu_event_summary_title_first"})date_absAll = tr.find_all("div",{"class": "ntu_event_summary_date"})events_absAll = tr.find_all("div",{"class": "ntu_event_summary_title"})for first in events_absFirst:    print('-',first.text.strip())    print (' ',date)for tr in soup.find_all("div",{"class":"ntu_event_detail"}):    date_absAll = tr.find_all("div",{"class": "ntu_event_summary_date"})    events_absAll = tr.find_all("div",{"class": "ntu_event_summary_title"})    for events in events_absAll:        events = events.text.strip()    for date in date_absAll:        date = date.text.strip('^Time.*')    print ('-',events)    print (' ',date)
查看完整描述

2 回答

?
Qyouu

TA貢獻1786條經驗 獲得超11個贊

您可以遍歷div包含事件信息的s,存儲結果,然后打印每個:


import requests, re

from bs4 import BeautifulSoup as soup

d = soup(requests.get('https://www.ntu.edu.sg/events/Pages/default.aspx').text, 'html.parser')

results = [[getattr(i.find('div', {'class':re.compile('ntu_event_summary_title_first|ntu_event_summary_title')}), 'text', 'N/A'), getattr(i.find('div', {'class':'ntu_event_summary_detail'}), 'text', 'N/A')] for i in d.find_all('div', {'class':'ntu_event_articles'})]

new_results = [[a, re.findall('Date : .*?(?=\sTime)|Time : .*?(?=Venue)|Time : .*?(?=$)|Venue: [\w\W]+', b)] for a, b in results]

print('\n\n'.join('-{}\n{}'.format(a, '\n'.join(f'  {h}:{i}' for h, i in zip(['Date', 'Time', 'Venue'], b))) for a, b in new_results))

輸出:


-7th ASEF Rectors' Conference and Students' Forum (ARC7)

 Date:Date : 29 Nov 2018  to 14 May 2019

 Time:Time : 9:00am to 5:00pm


-Be a Youth Corps Leader

 Date:Date : 1 Dec 2018  to 31 Mar 2019

 Time:Time : 9:00am to 5:00pm


-NIE Visiting Artist Programme January 2019

 Date:Date : 14 Jan 2019  to 11 Apr 2019

 Time:Time : 9:00am to 8:00pm

 Venue:Venue: NIE Art gallery


-Exercise Classes for You: Healthy Campus@NTU

 Date:Date : 21 Jan 2019  to 18 Apr 2019

 Time:Time : 6:00pm to 7:00pm

 Venue:Venue: The Wave @ Sports & Recreation Centre


-[eLearning Course] Information & Media Literacy (From January 2019)

 Date:Date : 23 Jan 2019  to 31 May 2019

 Time:Time : 9:00am to 5:00pm

 Venue:Venue: NTULearn

 ...


查看完整回答
反對 回復 2021-12-09
?
米脂

TA貢獻1836條經驗 獲得超3個贊

您可以使用請求并測試 stripped_strings 的長度


import requests

from bs4 import BeautifulSoup

import pandas as pd


url_toscrape = "https://www.ntu.edu.sg/events/Pages/default.aspx"

response = requests.get(url_toscrape)


soup = BeautifulSoup(response.content, 'lxml')


events = [item.text for item in soup.select("[class^='ntu_event_summary_title']")]

data =  soup.select('.ntu_event_summary_date')

dates = []

times = []

venues = []


for item in  data:

        strings = [string for string in item.stripped_strings]

        if len(strings) == 3:

            dates.append(strings[0])

            times.append(strings[1])

            venues.append(strings[2])    

        elif len(strings) == 2:

            dates.append(strings[0])

            times.append(strings[1])

            venues.append('N/A')    

        elif len(strings) == 1:

            dates.append(strings[0])

            times.append('N/A')

            venues.append('N/A')   


results = list(zip(events, dates, times, venues))

df = pd.DataFrame(results)

print(df)


查看完整回答
反對 回復 2021-12-09
  • 2 回答
  • 0 關注
  • 218 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯系客服咨詢優惠詳情

幫助反饋 APP下載

慕課網APP
您的移動學習伙伴

公眾號

掃描二維碼
關注慕課網微信公眾號