亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

我正在使用 selenium 抓取一個網站,beautifulsoup。

我正在使用 selenium 抓取一個網站,beautifulsoup。

米脂 2022-01-18 21:19:04
我正在使用 selenium webdriver 和美麗的湯來抓取一個具有可變數量的多個頁面的網站。我通過xpath. 一頁顯示五頁,計數為五后,我按下下一步按鈕并重置xpath計數以獲取接下來的 5 頁。為此,我需要通過代碼或更好的方式導航到不同頁面的網站總頁數。我認為該頁面使用角度 java 腳本進行導航。代碼如下:import requestsfrom bs4 import BeautifulSoupfrom selenium import webdriverdriver = webdriver.Chrome()driver.maximize_window()spg_index=' 'url = "https://www.bseindia.com/corporates/ann.html"driver.get(url)soup = BeautifulSoup(driver.page_source, 'html.parser')html=soup.prettify()with open('bseann.txt', 'w', encoding='utf-8') as f:    f.write(html)time.sleep(1)i=1  #index for page numbers navigated. ket at maximum 31 at presentk=1  #goes upto 5, the maximum navigating pages shown at one timewhile i <31:    next_pg=9   #xpath number to pinpoint to "next" page     snext_pg=str(next_pg)    snext_pg=snext_pg.strip()    if i> 5:        next_pg=10  #when we go to next set of pages thr is a addl option        if(i==6) or(i==11)or(i==16):#resetting xpath indx for set of pg's        k=2        path='/html/body/div[1]/div[5]/div[2]/div[1]/div[1]/ul/li['        path=path+snext_pg+']/a'        next_page_btn_list=driver.find_elements_by_xpath(path)        next_page_btn=next_page_btn_list[0]        next_page_btn.click()  #click next page        time.sleep(1)    pg_index= k+2    spg_index=str(pg_index)    spg_index=spg_index.strip()         path= '/html/body/div[1]/div[5]/div[2]/div[1]/div[1]/ul/li['    path=path+spg_index+']/a'    next_page_btn_list=driver.find_elements_by_xpath(path)    next_page_btn=next_page_btn_list[0]    next_page_btn.click()  #click specific pg no.     time.sleep(1)    soup = BeautifulSoup(driver.page_source, 'html.parser')    html=soup.prettify()    i=i+1    k=k+1    with open('bseann.txt', 'a', encoding='utf-8') as f:        f.write(html)
查看完整描述

2 回答

?
小唯快跑啊

TA貢獻1863條經驗 獲得超2個贊

無需在此處使用 Selenium,因為您可以從 API 訪問信息。這拉動了 247 條公告:


import requests

from pandas.io.json import json_normalize


url = 'https://api.bseindia.com/BseIndiaAPI/api/AnnGetData/w'


headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}


payload = {

'strCat': '-1',

'strPrevDate': '20190423',

'strScrip': '',

'strSearch': 'P',

'strToDate': '20190423',

'strType': 'C'}


jsonData = requests.get(url, headers=headers, params=payload).json()


df = json_normalize(jsonData['Table'])

df['ATTACHMENTNAME'] = '=HYPERLINK("https://www.bseindia.com/xml-data/corpfiling/AttachLive/' + df['ATTACHMENTNAME'] + '")'



df.to_csv('C:/filename.csv', index=False)

輸出:


...


GYSCOAL ALLOYS LTD. - 533275 - Announcement under Regulation 30 (LODR)-Code of Conduct under SEBI (PIT) Regulations, 2015

https://www.bseindia.com/xml-data/corpfiling/AttachLive/82f18673-de98-4a88-bbea-7d8499f25009.pdf


INDIAN SUCROSE LTD. - 500319 - Certificate Under Regulation 40(9) Of Listing Regulation For The Half Year Ended 31.03.2019

https://www.bseindia.com/xml-data/corpfiling/AttachLive/2539d209-50f6-4e56-a123-8562067d896e.pdf


Dhanvarsha Finvest Ltd - 540268 - Reply To Clarification Sought From The Company

https://www.bseindia.com/xml-data/corpfiling/AttachLive/f8d80466-af58-4336-b251-a9232db597cf.pdf


Prabhat Telecoms (India) Ltd - 540027 - Signing Of Framework Supply Agreement With METRO Cash & Carry India Private Limited

https://www.bseindia.com/xml-data/corpfiling/AttachLive/acfb1f72-efd3-4515-a583-2616d2942e78.pdf


...


查看完整回答
反對 回復 2022-01-18
?
慕慕森

TA貢獻1856條經驗 獲得超17個贊

有關您的用例的更多信息將有助于回答您的問題。但是,要提取有關您可以訪問該網站的網站內頁面總數的信息,請單擊帶有文本作為下一步的項目并提取所需的數據,您可以使用以下解決方案:


代碼塊:


from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC


options = webdriver.ChromeOptions() 

options.add_argument("start-maximized")

options.add_argument("--disable-extensions")

# options.add_argument('disable-infobars')

driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')

driver.get("https://www.bseindia.com/corporates/ann.html")

WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[text()='Disclaimer']//following::div[1]//li[@class='pagination-last ng-scope']/a[@class='ng-binding' and text()='Last']"))).click()

print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//a[text()='Disclaimer']//following::div[1]//li[@class='pagination-page ng-scope active']/a[@class='ng-binding']"))).get_attribute("innerHTML"))

控制臺輸出:


17


查看完整回答
反對 回復 2022-01-18
  • 2 回答
  • 0 關注
  • 214 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯系客服咨詢優惠詳情

幫助反饋 APP下載

慕課網APP
您的移動學習伙伴

公眾號

掃描二維碼
關注慕課網微信公眾號