首頁猿問使用...

使用 ProcessPoolExecutor 進行 Web Scraping：

Python

回首憶惘然 2022-01-18 15:31:59

我編寫了一個程序來抓取單個網站并抓取某些數據。我想通過使用來加快它的執行速度ProcessingPoolExecutor。但是，我無法理解如何從單線程轉換為并發。具體來說，在創建作業時（通過ProcessPoolExecutor.submit()），我可以傳遞類/對象和參數而不是函數和參數嗎？而且，如果是這樣，如何將這些作業的數據返回到隊列以跟蹤訪問過的頁面和保存抓取內容的結構？我一直以此為出發點，并查看了Queue和concurrent.futures文檔（坦率地說，后者讓我有點不知所措）。我也用谷歌搜索/Youtubed/SO'ed 很多都無濟于事。from queue import Queue, Emptyfrom concurrent.futures import ProcessPoolExecutorclass Scraper: """ Scrapes a single url """ def __init__(self, url): self.url = url # url of page to scrape self.internal_urls = None self.content = None self.scrape() def scrape(self): """ Method(s) to request a page, scrape links from that page to other pages, and finally scrape actual content from the current page """ # assume that code in this method would yield urls linked in current page self.internal_urls = set(scraped_urls) # and that code in this method would scrape a bit of actual content self.content = {'content1': content1, 'content2': content2, 'etc': etc}class CrawlManager: """ Manages a multiprocess crawl and scrape of a single site """ def __init__(self, seed_url): self.seed_url = seed_url self.pool = ProcessPoolExecutor(max_workers=10) self.processed_urls = set([]) self.queued_urls = Queue() self.queued_urls.put(self.seed_url) self.data = {} def crawl(self): while True: try: # get a url from the queue target_url = self.queued_urls.get(timeout=60) # check that the url hasn't already been processed if target_url not in self.processed_urls: # add url to the processed list self.processed_urls.add(target_url) print(f'Processing url {target_url}') # passing an object to the # ProcessPoolExecutor... can this be done? job = self.pool.submit(Scraper, target_url)

查看完整描述

1 回答

白衣染霜花

TA貢獻1796條經驗獲得超10個贊

對于遇到此頁面的任何人，我都能自己解決這個問題。

根據@brad-solomon 的建議，我從切換ProcessPoolExecutor到ThreadPoolExecutor來管理該腳本的并發方面（有關更多詳細信息，請參閱他的評論）。

Wrt最初的問題，關鍵是利用add_done_callback方法ThreadPoolExecutor結合修改Scraper.scrape和新方法CrawlManager.proc_scraper_results，如下所示：

from queue import Queue, Empty

from concurrent.futures import ThreadPoolExecutor

class Scraper:

"""

Scrapes a single url

"""

def __init__(self, url):

self.url = url # url of page to scrape

self.internal_urls = None

self.content = None

self.scrape()

def scrape(self):

"""

Method(s) to request a page, scrape links from that page

to other pages, and finally scrape actual content from the current page

"""

# assume that code in this method would yield urls linked in current page

self.internal_urls = set(scraped_urls)

# and that code in this method would scrape a bit of actual content

self.content = {'content1': content1, 'content2': content2, 'etc': etc}

# these three items will be passed to the callback

# function with in a future object

return self.internal_urls, self.url, self.content

class CrawlManager:

"""

Manages a multiprocess crawl and scrape of a single website

"""

def __init__(self, seed_url):

self.seed_url = seed_url

self.pool = ThreadPoolExecutor(max_workers=10)

self.processed_urls = set([])

self.queued_urls = Queue()

self.queued_urls.put(self.seed_url)

self.data = {}

def proc_scraper_results(self, future):

# get the items of interest from the future object

internal_urls, url, content = future._result[0], future._result[1], future._result[2]

# assign scraped data/content

self.data[url] = content

# also add scraped links to queue if they

# aren't already queued or already processed

for link_url in internal_urls:

if link_url not in self.to_crawl.queue and link_url not in self.processed_urls:

self.to_crawl.put(link_url)

def crawl(self):

while True:

try:

# get a url from the queue

target_url = self.queued_urls.get(timeout=60)

# check that the url hasn't already been processed

if target_url not in self.processed_urls:

# add url to the processed list

self.processed_urls.add(target_url)

print(f'Processing url {target_url}')

# add a job to the ThreadPoolExecutor (note, unlike original question, we pass a method, not an object)

job = self.pool.submit(Scraper(target_url).scrape)

# to add_done_callback we send another function, this one from CrawlManager

# when this function is itself called, it will be pass a `future` object

job.add_done_callback(self.proc_scraper_results)

except Empty:

print("All done.")

except Exception as e:

print(e)

if __name__ == '__main__':

crawler = CrawlManager('www.mywebsite.com')

crawler.crawl()

其結果是該計劃的持續時間顯著減少。

反對回復 2022-01-18

1 回答
0 關注
167 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

使用 ProcessPoolExecutor 進行 Web Scraping：

使用 ProcessPoolExecutor 進行 Web Scraping：

1 回答

添加回答