亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

無法抓取谷歌圖片硒

無法抓取谷歌圖片硒

慕標5832272 2022-10-06 19:54:52
我有以下腳本,我希望它可以抓取谷歌圖片。它首先單擊圖像,然后單擊下一個(>)按鈕以切換到下一個圖像。它下載第一張圖片,但是當它輪到第二張圖片時,它會拋出一個錯誤。Traceback (most recent call last):  File "c:/Users/intel/Desktop/Scrappr/image_scrape.pyw", line 40, in <module>    attribute_value = WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CLASS_NAME, 'n3VNCb'))).get_attribute("src")  File "C:\Users\intel\AppData\Local\Programs\Python\Python38\lib\site-packages\selenium\webdriver\support\wait.py", line 80, in until    raise TimeoutException(message, screen, stacktrace)selenium.common.exceptions.TimeoutException: Message:我的代碼:import requestsimport shutilimport timeimport urllibfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.common.keys import Keysfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom bs4 import BeautifulSoup as Soupfrom selenium.webdriver.chrome.options import Optionsfrom selenium import webdriveruser_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) ' \             'Chrome/80.0.3987.132 Safari/537.36'options = Options()#options.add_argument("--headless")options.add_argument(f'user-agent={user_agent}')options.add_argument("--disable-web-security")options.add_argument("--allow-running-insecure-content")options.add_argument("--allow-cross-origin-auth-prompt")driver = webdriver.Chrome(executable_path=r"C:\Users\intel\Downloads\setups\chromedriver.exe", options=options)driver.get("https://www.google.com/search?q=mac+beautiful+ui&tbm=isch&ved=2ahUKEwiL3ILMveToAhWGCHIKHVPNAScQ2-cCegQIABAA&oq=mac+beautiful+ui&gs_lcp=CgNpbWcQAzoECAAQQzoCCAA6BQgAEIMBOgYIABAFEB46BggAEAgQHlDPI1iEUWCgU2gAcAB4AIAByAKIAd8dkgEHMC40LjkuM5gBAKABAaoBC2d3cy13aXotaW1n&sclient=img&ei=Q9-TXsuuMoaRyAPTmoe4Ag&bih=657&biw=1360")driver.find_element_by_class_name("rg_i").click()
查看完整描述

2 回答

?
慕蓋茨4494581

TA貢獻1850條經驗 獲得超11個贊

我已經整理并重構了一些代碼。最終結果能夠為您選擇的關鍵字抓取 n 個圖像(請參閱 參考資料SEARCH_TERMS):


import base64

import os

import requests

import time


from io import BytesIO

from PIL import Image

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

from selenium.webdriver.chrome.options import Options

from selenium import webdriver


CHROME_DRIVER_LOCATION = r'C:\Users\intel\Downloads\setups\chromedriver.exe'

SEARCH_TERMS = ['very', 'hot', 'chicks']

TARGET_SAVE_LOCATION = os.path.join(r'c:\test', '_'.join([x.capitalize() for x in SEARCH_TERMS]),  r'{}.{}')

if not os.path.isdir(os.path.dirname(TARGET_SAVE_LOCATION)):

    os.makedirs(os.path.dirname(TARGET_SAVE_LOCATION))


def check_if_result_b64(source):

    possible_header = source.split(',')[0]

    if possible_header.startswith('data') and ';base64' in possible_header:

        image_type = possible_header.replace('data:image/', '').replace(';base64', '')

        return image_type

    return False


def get_driver():


    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) ' \

                 'Chrome/80.0.3987.132 Safari/537.36'

    options = Options()

    #options.add_argument("--headless")

    options.add_argument(f'user-agent={user_agent}')

    options.add_argument("--disable-web-security")

    options.add_argument("--allow-running-insecure-content")

    options.add_argument("--allow-cross-origin-auth-prompt")


    new_driver = webdriver.Chrome(executable_path=CHROME_DRIVER_LOCATION, options=options)

    new_driver.get(f"https://www.google.com/search?q={'+'.join(SEARCH_TERMS)}&source=lnms&tbm=isch&sa=X")

    return new_driver




driver = get_driver()


first_search_result = driver.find_elements_by_xpath('//a/div/img')[0]

first_search_result.click()


right_panel_base = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, f'''//*[@data-query="{' '.join(SEARCH_TERMS)}"]''')))

first_image = right_panel_base.find_elements_by_xpath('//*[@data-noaft="1"]')[0]

magic_class = first_image.get_attribute('class')

image_finder_xp = f'//*[@class="{magic_class}"]'



# initial wait for the first image to be loaded

# this part could be improved but I couldn't find a proper way of doing it

time.sleep(3)


# initial thumbnail for "to_be_loaded image"

thumbnail_src = driver.find_elements_by_xpath(image_finder_xp)[-1].get_attribute("src")


for i in range(10):


    # issue 4: All image elements share the same class. Assuming that you always click "next":

    # The last element is the base64 encoded thumbnail version is of the "next image"

    # [-2] element is the element currently displayed

    target = driver.find_elements_by_xpath(image_finder_xp)[-2]


    # you need to wait until image is completely loaded:

    # first the base64 encoded thumbnail will be displayed

    # so we check if the displayed element src match the cached thumbnail src.

    # However sometimes the final result is the base64 content, so wait is capped

    # at 5 seconds.

    wait_time_start = time.time()

    while (target.get_attribute("src") == thumbnail_src) and time.time() < wait_time_start + 5:

        time.sleep(0.2)

    thumbnail_src = driver.find_elements_by_xpath(image_finder_xp)[-1].get_attribute("src")

    attribute_value = target.get_attribute("src")

    print(attribute_value)


    # issue 1: if the image is base64, requests get won't work because the src is not an url

    is_b64 = check_if_result_b64(attribute_value)

    if is_b64:

        image_format = is_b64

        content = base64.b64decode(attribute_value.split(';base64')[1])

    else:

        resp = requests.get(attribute_value, stream=True)

        temp_for_image_extension = BytesIO(resp.content)

        image = Image.open(temp_for_image_extension)

        image_format = image.format

        content = resp.content

    # issue 2: if you 'open' a file, later you have to close it. Use a "with" pattern instead

    with open(TARGET_SAVE_LOCATION.format(i, image_format), 'wb') as f:

        f.write(content)

    # issue 3: this Xpath is bad """//*[@id="Sva75c"]/div/div/div[3]/div[2]/div/div[1]/div[1]/div/div[1]/a[2]/div""" if page layout changes, this path breaks instantly

    svg_arrows_xpath = '//div[@jscontroller]//a[contains(@jsaction, "click:trigger")]//*[@viewBox="0 0 24 24"]'

    next_arrow = driver.find_elements_by_xpath(svg_arrows_xpath)[-3]

    next_arrow.click()


查看完整回答
反對 回復 2022-10-06
?
米脂

TA貢獻1836條經驗 獲得超3個贊

免責聲明:我懷疑 Google 是否允許在搜索中進行抓取。您應該查看https://www.google.com/robots.txt以找出答案。

話雖如此,我認為您的WebDriverWait方法存在問題,盡管我不確定它到底是什么。由于您已經讓您的驅動程序在此之前等待time.sleep,因此我只是嘗試直接找到該元素,并且它有效:

i = 0

while i < 10:

    i += 1

    time.sleep(5)

    attribute_value = driver.find_element_by_css_selector("img.n3VNCb").get_attribute("src") # NEW LINE

    print(attribute_value)

    resp = requests.get(attribute_value, stream=True)

    local_file = open(r'C:/users/intel/desktop/local_image'+ str(i) + '.jpg', 'wb')

    resp.raw.decode_content = True

    shutil.copyfileobj(resp.raw, local_file)

    del resp

    driver.find_element_by_xpath("""//*[@id="Sva75c"]/div/div/div[3]/div[2]/div/div[1]/div[1]/div/div[1]/a[2]/div""").click()



查看完整回答
反對 回復 2022-10-06
  • 2 回答
  • 0 關注
  • 92 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯系客服咨詢優惠詳情

幫助反饋 APP下載

慕課網APP
您的移動學習伙伴

公眾號

掃描二維碼
關注慕課網微信公眾號