我spider_idle設置了一個信號來向蜘蛛提供另一批網址。但是,這在開始時似乎工作正常,但隨后Crawled (200)...消息出現的次數越來越少,最終停止出現。我有 115 個測試 URL 可以分發,正如 Scrapy 所說的Crawled 38 pages...那樣。下面是蜘蛛和scrapy日志的代碼。一般來說,我正在實現兩階段爬行,第一遍僅將 url 下載到urls.jl文件,第二遍是對這些 URL 執行抓取。我現在正在接近第二個蜘蛛的編碼。import jsonimport scrapyimport loggingfrom scrapy import signalsfrom scrapy.http.request import Requestfrom scrapy.exceptions import DontCloseSpiderclass A2ndexample_comSpider(scrapy.Spider): name = '2nd_example_com' allowed_domains = ['www.example.com'] def parse(self, response): pass @classmethod def from_crawler(cls, crawler, *args, **kwargs): spider = cls(crawler, *args, **kwargs) crawler.signals.connect(spider.idle_consume, signals.spider_idle) return spider def __init__(self, crawler): self.crawler = crawler # read from file self.urls = [] with open('urls.jl', 'r') as f: for line in f: self.urls.append(json.loads(line)) # How many urls to return from start_requests() self.batch_size = 5 def start_requests(self): for i in range(self.batch_size): if 0 == len(self.urls): return url = self.urls.pop(0) yield Request(url["URL"]) def idle_consume(self): # Everytime spider is about to close check our urls # buffer if we have something left to crawl reqs = self.start_requests() if not reqs: return logging.info('Consuming batch... [left: %d])' % len(self.urls)) for req in reqs: self.crawler.engine.schedule(req, self) raise DontCloseSpider我預計蜘蛛會抓取所有 115 個 URL,而不僅僅是 38 個。此外,如果它不想再抓取,并且 singal-handler 函數沒有引發DontCloseSpider,那么它至少不應該關閉然后?
為什么我的spider_idle / on-demand / URL-feeding 像逐漸關閉?
慕碼人8056858
2021-10-26 15:45:42