如果要自定义起始链接,也可以重写scrapy.Spider
类的start_requests
函数,此处不予细讲。
parse
函数是一个默认的回调函数,当下载器下载网页后,会调用该函数进行解析,response
就是请求包的响应数据。至于网页内容的解析方法,scrapy
内置了几种选择器(Selector
),包括xpath
选择器、CSS
选择器和正则匹配。下面是一些选择器的使用示例,方便大家更加直观的了解选择器的用法。
# xpath selector response.xpath('//a') response.xpath('./img').extract() response.xpath('//*[@id="huaban"]').extract_first() repsonse.xpath('//*[@id="Profile"]/div[1]/a[2]/text()').extract_first() # css selector response.css('a').extract() response.css('#Profile > div.profile-basic').extract_first() response.css('a[href="test.html"]::text').extract_first() # re selector response.xpath('.').re('id:\s*(\d+)') response.xpath('//a/text()').re_first('username: \s(.*)')
需要说明的是,response
不能直接调用re
,re_first
.
scrapy crawl
假设爬虫编写完了,那就可以使用scrapy crawl
指令开始执行爬取任务了。
当进入一个创建好的scrapy
项目目录时,使用scrapy -h
可以获得相比未创建之前更多的帮助信息,其中就包括用于启动爬虫任务的scrapy crawl
$ scrapy -h Scrapy 1.5.0 - project: huaban Usage: scrapy <command> [options] [args] Available commands: bench Run quick benchmark test check Check spider contracts crawl Run a spider edit Edit spider fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates list List available spiders parse Parse URL (using its spider) and print the results runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy Use "scrapy <command> -h" to see more info about a command
$ scrapy crawl -h Usage ===== scrapy crawl [options] <spider> Run a spider Options ======= --help, -h show this help message and exit -a NAME=VALUE set spider argument (may be repeated) --output=FILE, -o FILE dump scraped items into FILE (use - for stdout) --output-format=FORMAT, -t FORMAT format to use for dumping items with -o Global Options -------------- --logfile=FILE log file. if omitted stderr will be used --loglevel=LEVEL, -L LEVEL log level (default: DEBUG) --nolog disable logging completely --profile=FILE write python cProfile stats to FILE --pidfile=FILE write process ID to FILE --set=NAME=VALUE, -s NAME=VALUE set/override setting (may be repeated) --pdb enable pdb on failure
从scrapy crawl
的帮助信息可以看出,该指令包含很多可选参数,但必选参数只有一个,就是spider
,即要执行的爬虫名称,对应每个爬虫的名称(name
)。
scrapy crawl huaban
至此,一个scrapy
爬虫任务的创建和执行过程就介绍完了,至于实例,后续博客会陆续介绍。
scrapy shell
最后简要说明一下指令scrapy shell
,这是一个交互式的shell
,类似于命令行形式的python
,当我们刚开始学习scrapy
或者刚开始爬虫某个陌生的站点时,可以使用它熟悉各种函数操作或者选择器的使用,用它来不断试错纠错,熟练掌握scrapy
各种用法。
$ scrapy shell www.huaban.com 2018-05-29 23:58:49 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot) 2018-05-29 23:58:49 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.3 (v3.6.3:2c5fed8, Oct 3 2017, 17:26:49) [MSC v.1900 32 bit (Intel)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.17134-SP0 2018-05-29 23:58:49 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0} 2018-05-29 23:58:49 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole'] 2018-05-29 23:58:50 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-05-29 23:58:50 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-05-29 23:58:50 [scrapy.middleware] INFO: Enabled item pipelines: [] 2018-05-29 23:58:50 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-05-29 23:58:50 [scrapy.core.engine] INFO: Spider opened 2018-05-29 23:58:50 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://huaban.com/> from <GET http://www.huaban.com> 2018-05-29 23:58:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://huaban.com/> (referer: None) [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x03385CB0> [s] item {} [s] request <GET http://www.huaban.com> [s] response <200 http://huaban.com/> [s] settings <scrapy.settings.Settings object at 0x04CC4D10> [s] spider <DefaultSpider 'default' at 0x4fa6bf0> [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) [s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help) [s] view(response) View response in a browser In [1]: view(response) Out[1]: True In [2]: response.xpath('//a') Out[2]: [<Selector xpath='//a' data='<a id="elevator" class="off" onclick="re'>, <Selector xpath='//a' data='<a class="plus"></a>'>, <Selector xpath='//a' data='<a onclick="app.showUploadDialog();">添加采'>, <Selector xpath='//a' data='<a class="add-board-item">添加画板<i class="'>, <Selector xpath='//a' data='<a href="/about/goodies/">安装采集工具<i class'>, <Selector xpath='//a' data='<a class="huaban_security_oauth" logo_si'>] In [3]: response.xpath('//a').extract() Out[3]: ['<a id="elevator" class="off" onclick="return false;" title="回到顶部"></a>', '<a class="plus"></a>', '<a onclick="app.showUploadDialog();">添加采集<i class="upload"></i></a>', '<a class="add-board-item">添加画板<i class="add-board"></i></a>', '<a href="/about/goodies/">安装采集工具<i class="goodies"></i></a>', '<a class="huaban_security_oauth" logo_size="124x47" logo_type="realname" src="" data-original="http://static.anquan.org/static/outer/js/aq_auth.js"></script> </a>'] In [4]: response.xpath('//img') Out[4]: [<Selector xpath='//img' data='<img class="lazyload" src="" data-original="https://d5nxst8fruw4z.cloudfro'>] In [5]: response.xpath('//a/text()') Out[5]: [<Selector xpath='//a/text()' data='添加采集'>, <Selector xpath='//a/text()' data='添加画板'>, <Selector xpath='//a/text()' data='安装采集工具'>, <Selector xpath='//a/text()' data=' '>, <Selector xpath='//a/text()' data=' '>] In [6]: response.xpath('//a/text()').extract() Out[6]: ['添加采集', '添加画板', '安装采集工具', ' ', ' '] In [7]: response.xpath('//a/text()').extract_first() Out[7]: '添加采集'
作者:litreily
链接:https://juejin.im/post/5b0d7b8b5188253bdb1b3c90
来源:掘金
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。