我正在嘗試構建一個網絡爬蟲以從 tsx 頁面獲取趨勢股票。我目前獲得了所有趨勢鏈接,現在我正在嘗試抓取各個頁面上的信息。根據我的代碼,當我嘗試在 getStockDetails() 中輸出“quote_wrapper”時,它返回一個空列表。我懷疑是因為 JavaScript 還沒有在頁面上呈現?不確定這是不是一回事。無論如何,我試圖輸出頁面上的所有html進行調試,我也沒有看到。我讀到“渲染”JavaScript 的唯一方法是使用 Selenium 并使用 browser.execute_script("return document.documentElement.outerHTML")。它適用于索引頁面,所以我嘗試在其他頁面上使用它。我也在代碼中對此進行了評論。謝謝你的幫助,如果可以的話。from selenium import webdriverfrom selenium.webdriver.common.keys import Keysfrom bs4 import BeautifulSoup as soupfrom urllib2 import urlopen as uReqimport timeimport randomimport requestsdef getTrendingQuotes(source_code): # grabs all the trending quotes for that day links = [] page_soup = soup(source_code, "lxml") trendingQuotes = page_soup.findAll("div", {"id": "trendingQuotes"}) all_trendingQuotes = trendingQuotes[0].findAll('a') for link in all_trendingQuotes: url = link.get('href') name = link.text # print(name) links.append(url) return linksdef getStockDetails(url, browser): print(url) source_code = browser.execute_script( "return document.documentElement.outerHTML") #What is the correct syntax here? #I'm trying to get the innerHTML of whole page in selenium driver #It seems I can only access the JavaScript for the entire page this way # source_code = browser.execute_script( # "return" + url +".documentElement.outerHTML") page_soup = soup(source_code, "html.parser") # print(page_soup) quote_wrapper = page_soup.findAll("div", {"class": "quoteWrapper"}) print(quote_wrapper)def trendingBot(browser): while True: source_code = browser.execute_script( "return document.documentElement.outerHTML") trending = getTrendingQuotes(source_code) for trend in trending: browser.get(trend) getStockDetails(trend, browser) break # print(trend)
添加回答
舉報
0/150
提交
取消