首頁猿問腳本在多個之間使用特定鏈接時引發錯誤

腳本在多個之間使用特定鏈接時引發錯誤

Python

喵喔喔 2022-06-02 10:27:15

我編寫了一個腳本scrapy，結合使用selenium來解析CEO網頁中不同公司的名稱。您可以在登錄頁面中找到不同公司的名稱。CEO但是，一旦您單擊公司鏈接的名稱，您就可以獲得's 的名稱。以下腳本可以解析不同公司的鏈接，并使用這些鏈接來抓取CEO除第二家公司之外的 'S 的名稱。當腳本嘗試解析CEO使用第二家公司的鏈接的名稱時，它會遇到stale element reference error. 即使在途中遇到該錯誤，該腳本也會以正確的方式獲取其余結果。再一次 - 它只會在使用第二個公司鏈接解析信息時引發錯誤。好奇怪?。∵@是我迄今為止嘗試過的：import scrapyfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECclass FortuneSpider(scrapy.Spider): name = 'fortune' url = 'http://fortune.com/fortune500/list/' def start_requests(self): self.driver = webdriver.Chrome() self.wait = WebDriverWait(self.driver,10) yield scrapy.Request(self.url,callback=self.get_links) def get_links(self,response): self.driver.get(response.url) for item in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[class*="searchResults__title--"]'))): company_link = item.find_element_by_css_selector('a[class*="searchResults__cellWrapper--"]').get_attribute("href") yield scrapy.Request(company_link,callback=self.get_inner_content) def get_inner_content(self,response): self.driver.get(response.url) chief_executive = self.wait.until(EC.presence_of_element_located((By.XPATH, '//tr[td[.="CEO"]]//td[contains(@class,"dataTable__value--")]/div'))).text yield {'CEO': chief_executive}這是我得到的結果類型：Jeffrey P. Bezosraise exception_class(message, screen, stacktrace)selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document (Session info: chrome=76.0.3809.132)Darren W. WoodsTimothy D. CookWarren E. BuffettBrian S. TylerC. Douglas McMillonDavid S. WichmannRandall L. StephensonSteven H. Collisand so on------------如何解決我的腳本在處理第二個公司鏈接時遇到的錯誤？PS 我可以使用他們的 api 來獲取所有信息，但我很想知道為什么上面的腳本面臨這個奇怪的問題。

查看完整描述

3 回答

慕桂英546537

TA貢獻1848條經驗獲得超10個贊

稍加修改的方法應該可以讓您從該站點獲得所有所需的內容，而不會出現任何問題。您需要做的就是將所有目標鏈接存儲為方法中的列表get_links()并使用return或yield在對方法進行回調時使用get_inner_content()。您還可以禁用圖像以使腳本稍快一些。

以下嘗試應該為您提供所有結果：

import scrapy

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

from scrapy.crawler import CrawlerProcess

class FortuneSpider(scrapy.Spider):

name = 'fortune'

url = 'http://fortune.com/fortune500/list/'

def start_requests(self):

option = webdriver.ChromeOptions()

chrome_prefs = {}

option.experimental_options["prefs"] = chrome_prefs

chrome_prefs["profile.default_content_settings"] = {"images": 2}

chrome_prefs["profile.managed_default_content_settings"] = {"images": 2}

self.driver = webdriver.Chrome(options=option)

self.wait = WebDriverWait(self.driver,10)

yield scrapy.Request(self.url,callback=self.get_links)

def get_links(self,response):

self.driver.get(response.url)

item_links = [item.get_attribute("href") for item in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[class*="searchResults__title--"] a[class*="searchResults__cellWrapper--"]')))]

return [scrapy.Request(link,callback=self.get_inner_content) for link in item_links]

def get_inner_content(self,response):

self.driver.get(response.url)

chief_executive = self.wait.until(EC.presence_of_element_located((By.XPATH, '//tr[td[.="CEO"]]//td[contains(@class,"dataTable__value--")]/div'))).text

yield {'CEO': chief_executive}

if __name__ == "__main__":

process = CrawlerProcess()

process.crawl(FortuneSpider)

process.start()

或使用yield：

def get_links(self,response):

self.driver.get(response.url)

for link in item_links:

yield scrapy.Request(link,callback=self.get_inner_content)

反對回復 2022-06-02

小唯快跑啊

TA貢獻1863條經驗獲得超2個贊

要從網頁https://fortune.com/fortune500/search/ Selenium本身解析不同公司 CEO 的姓名就足夠了，您需要：

滾動到網頁上的最后一項。
收集href屬性并存儲在列表中。
在相鄰 選項卡中打開href
將焦點切換到新打開的選項卡并誘導WebDriverWait，visibility_of_element_located()您可以使用以下Locator Strategies：

代碼塊：

# -*- coding: UTF-8 -*-

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()

options.add_argument("start-maximized")

options.add_experimental_option("excludeSwitches", ["enable-automation"])

options.add_experimental_option('useAutomationExtension', False)

driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')

driver.get("https://fortune.com/fortune500/search/")

driver.execute_script("arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[text()='Explore Lists from Other Years']"))))

my_hrefs = [my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[starts-with(@class, 'searchResults__cellWrapper--') and contains(@href, 'fortune500')][.//span/div]")))]

windows_before = driver.current_window_handle

for my_href in my_hrefs:

driver.execute_script("window.open('" + my_href +"');")

WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))

windows_after = driver.window_handles

new_window = [x for x in windows_after if x != windows_before][0]

driver.switch_to_window(new_window)

print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table/tbody/tr//td[starts-with(@class, 'dataTable__value')]/div"))).text)

driver.close() # close the window

driver.switch_to.window(windows_before) # switch_to the parent_window_handle

driver.quit()

控制臺輸出：

C. Douglas McMillon

Darren W. Woods

Timothy D. Cook

Warren E. Buffett

Jeffrey P. Bezos

David S. Wichmann

Brian S. Tyler

Larry J. Merlo

Randall L. Stephenson

Steven H. Collis

Michael K. Wirth

James P. Hackett

Mary T. Barra

W. Craig Jelinek

Larry Page

Michael C. Kaufmann

Stefano Pessina

James Dimon

Hans E. Vestberg

W. Rodney McMullen

H. Lawrence Culp Jr.

Hugh R. Frater

Greg C. Garland

Joseph W. Gorder

Brian T. Moynihan

Satya Nadella

Craig A. Menear

Dennis A. Muilenburg

C. Allen Parker

Michael L. Corbat

Gary R. Heminger

Brian L. Roberts

Gail K. Boudreaux

Michael S. Dell

Marc Doyle

Michael L. Tipsord

Alex Gorsky

Virginia M. Rometty

Brian C. Cornell

Donald H. Layton

David P. Abney

Marvin R. Ellison

Robert H. Swan

Michel A. Khalaf

David S. Taylor

Gregory J. Hayes

Frederick W. Smith

Ramon L. Laguarta

Juan R. Luciano

反對回復 2022-06-02

滄海一幻覺

TA貢獻1824條經驗獲得超5個贊

以下是如何在不使用 Selenium 的情況下更快、更輕松地獲取公司詳細信息的方法。

查看我如何獲取company_name并change_the_world提取其他詳細信息。

import requests

from bs4 import BeautifulSoup

import re

import html

with requests.Session() as session:

response = session.get("https://content.fortune.com/wp-json/irving/v1/data/franchise-search-results?list_id=2611932")

items = response.json()[1]["items"]

for item in items:

company_name = html.unescape(list(filter(lambda x: x['key'] == 'name', item["fields"]))[0]["value"])

change_the_world = list(filter(lambda x: x['key'] == 'change-the-world-y-n', item["fields"]))[0]["value"]

response = session.get(item["permalink"])

preload_data = BeautifulSoup(response.text, "html.parser").select_one("#preload").text

ceo = re.search('"ceo","value":"(.*?)"', preload_data).groups()[0]

print(f"Company: {company_name}, CEO: {ceo}, Change The World: {change_the_world}")

結果：

公司：Carvana，首席執行官：Ernest C. Garcia，Change The World：否

公司：ManTech International，首席執行官：Kevin M. Phillips，Change The World：否

公司：NuStar Energy，首席執行官：Bradley C. Barron，Change The World：否

公司：Shutterfly，首席執行官：Ryan O'Hara，改變世界：無

公司：Spire，首席執行官：Suzanne Sitherwood，改變世界：無

公司：Align Technology，首席執行官：Joseph M. Hogan，改變世界：無

公司：Herc控股公司，首席執行官：Lawrence H. Silber，改變世界：不

...

反對回復 2022-06-02

3 回答
0 關注
156 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

腳本在多個之間使用特定鏈接時引發錯誤

腳本在多個之間使用特定鏈接時引發錯誤

3 回答

添加回答