首頁猿問 Selenium：根據網站每個類別...

Selenium：根據網站每個類別的頁面數量進行抓取

Python

繁花如伊 2023-09-12 19:53:35

我在這個網站上進行了網絡抓?。篽ttp://www.legorafi.fr/ 它適用于每個類別（政治等），但對于每個類別，我循環瀏覽相同數量的頁面。我希望能夠根據該網站中每個類別的頁面數量來抓取所有頁面。我這樣做是為了循環瀏覽頁面：import timefrom selenium import webdriverfrom selenium.common.exceptions import NoSuchElementExceptionfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support import expected_conditions as ECfrom selenium.webdriver.common.action_chains import ActionChainsimport newspaperimport requestsfrom newspaper.utils import BeautifulSoupfrom newspaper import Article#categories = ['france/politique','france/societe', 'monde-libre', 'france/economie/', 'culture', 'people', 'sports', 'hi-tech', 'sciences']papers = []driver = webdriver.Chrome(executable_path="/Users/name/Downloads/chromedriver 4")#driver.get('http://www.legorafi.fr/')for category in categories: url = 'http://www.legorafi.fr/category/' + category #WebDriverWait(self.driver, 10) driver.get(url) Foo() time.sleep(2) pagesToGet = 120pagesToGet = 120title = []content = []for page in range(1, pagesToGet+1): print('Processing page :', page) #url = 'http://www.legorafi.fr/category/france/politique/page/'+str(page) print(driver.current_url) #print(url) raw_html = requests.get(url) soup = BeautifulSoup(raw_html.text, 'html.parser') for articles_tags in soup.findAll('div', {'class': 'articles'}): for article_href in articles_tags.find_all('a', href=True): if not str(article_href['href']).endswith('#commentaires'): urls_set.add(article_href['href']) papers.append(article_href['href'])我想循環瀏覽所有這些類別，并根據每個類別的頁數。categories = ['france/politique','france/societe', 'monde-libre', 'france/economie/', 'culture', 'people', 'sports', 'hi-tech', 'sciences']我該怎么做？

查看完整描述

1 回答

慕斯709654

TA貢獻1840條經驗獲得超5個贊

下面的代碼能夠遍歷所有類別并提取數據。該代碼肯定需要更多的測試和一些增強的錯誤處理。

PS祝你在這個編碼項目中好運。

import requests

import time

from random import randint

from datetime import datetime

from selenium import webdriver

from selenium.webdriver.chrome.options import Options

from selenium.common.exceptions import NoSuchElementException

from newspaper.utils import BeautifulSoup

from newspaper import Article

chrome_options = Options()

chrome_options.add_argument("--test-type")

chrome_options.add_argument('--ignore-certificate-errors')

chrome_options.add_argument('--disable-extensions')

chrome_options.add_argument('disable-infobars')

chrome_options.add_argument("--incognito")

# chrome_options.add_argument('--headless')

# window size as an argument is required in headless mode

# chrome_options.add_argument('window-size=1920x1080')

driver = webdriver.Chrome('/usr/local/bin/chromedriver', options=chrome_options)

papers = []

urls_set = set()

def get_articles(link):

while True:

try:

next_link = driver.find_element_by_link_text("Suivant")

if next_link:

raw_html = requests.get(url)

soup = BeautifulSoup(raw_html.text, 'html.parser')

for articles_tags in soup.findAll('div', {'class': 'articles'}):

for article_href in articles_tags.find_all('a', href=True):

if not str(article_href['href']).endswith('#commentaires'):

article = Article(article_href['href'])

article.download()

article.parse()

if article.url is not None:

article_url = article_href['href']

title = article.title

publish_date = datetime.strptime(str(article.publish_date),

'%Y-%m-%d %H:%M:%S').strftime('%Y-%m-%d')

text_of_article = article.text.replace('\n', '')

driver.execute_script("arguments[0].scrollIntoView(true);", next_link)

next_link.click()

# Initiates a random wait to prevent the

# harvesting operation from starting before

# the page has completely loaded

time.sleep(randint(2, 4))

except NoSuchElementException:

return

legorafi_urls = {'monde-libre': 'http://www.legorafi.fr/category/monde-libre',

'politique': 'http://www.legorafi.fr/category/france/politique',

'societe': 'http://www.legorafi.fr/category/france/societe',

'economie': 'http://www.legorafi.fr/category/france/economie',

'culture': 'http://www.legorafi.fr/category/culture',

'people': 'http://www.legorafi.fr/category/people',

'sports': 'http://www.legorafi.fr/category/sports',

'hi-tech': 'http://www.legorafi.fr/category/hi-tech',

'sciences': 'http://www.legorafi.fr/category/sciences',

'ledito': 'http://www.legorafi.fr/category/ledito/'

}

for category, url in legorafi_urls.items():

if url:

browser = driver.get(url)

driver.implicitly_wait(30)

get_articles(browser)

else:

driver.quit()

反對回復 2023-09-12

1 回答
0 關注
129 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

Selenium：根據網站每個類別的頁面數量進行抓取

Selenium：根據網站每個類別的頁面數量進行抓取

1 回答

添加回答