首頁手記 Python爬蟲實戰-使用Scrapy框架爬取土巴兔(五)

Python爬蟲實戰-使用Scrapy框架爬取土巴兔(五)

標簽：

Python

上一篇文章Python爬虫实战-使用Scrapy框架爬取土巴兔(四)我们为爬虫工程定制了具体的爬取规则，那么接下来就要进一步处理爬取到的信息，并将它们持久化。

该篇文章主要讲在Scrapy中运用Item Pipeline对Spider中收集到的Item进行具体的处理。

一.下载图片

在Spider收集到的Item对象中，我们可以拿到图片的URL。我们要做的就是对图片进行下载、保存。

class DesignPicturePipeline(object):
    def __init__(self):        self.design_picture_service = DesignPictureService()    def process_item(self, item, spider):
        img_url = item['img_url']        #获取文件名
        img_name = ImageService.generate_name(img_url)        #获取文件路径
        file_path = ImageService.file_path(img_name)        #获取缩略图路径
        thumb_path = ImageService.thumb_path(img_name)        #下载图片并保存
        ImageService.download_img(img_url, file_path)        #保存缩略图
        ImageService.save_thumbnail(file_path, thumb_path)
        item['img_name'] = img_name        #保存到MongoDB
        self.design_picture_service.handle_item(item)

下载图片时我们用到http请求库requests

def download_img(img_url, file_path):
    proxies = None
    proxy = ''
    if config.USE_PROXY:
        proxy = proxy_pool.random_choice_proxy()
        proxies = {            'http': "http://%s" % proxy,
        }    try:
        response = requests.get(img_url, stream=True, proxies=proxies)        if response.status_code == 200:            with open(file_path, 'wb') as f:                for chunk in response.iter_content(1024):
                    f.write(chunk)        else:            if config.USE_PROXY:
                proxy_pool.add_failed_time(proxy)    except:        if config.USE_PROXY:
            proxy_pool.add_failed_time(proxy)

保存缩略图时使用了处理图片的开源库pillow,先对图片进行裁剪，然后保存到指定路径

from PIL import Image

IMAGE_SIZE = 500, 500def save_thumbnail(file_path, thumb_path):
    image = Image.open(file_path)    if thumb_path is not None:
        image.thumbnail(IMAGE_SIZE)
        image.save(thumb_path)    del image

Scrapy也提供了

保存在MongoDB中的数据格式：

{    "_id" : ObjectId("58cbd1f8ca6e1d0be0b1f329"),    "img_height" : "458",    "img_url" : "http://pic.to8to.com/case/1702/13/20170213_57a89af924138e15ab34ftkozxmlx8li.jpg",    "description" : "北欧风慵懒休闲感一居室图片",    "html_url" : "http://xiaoguotu.to8to.com/getxgtjson.php?a2=0&a12=&a11=10043653&a1=0",    "id" : "91de71b0-0b0a-11e7-9909-94de802721e7",    "create_time" : "2017-03-17T12:09:28.294Z",    "img_name" : "/tubatu/2017-03-17/41ec3e630289177dda3a91ad89a1ba72",    "sub_title" : "北欧风慵懒休闲感一居室图片",    "title" : "北欧风慵懒休闲感一居室图片",    "tags" : [ 
        "北欧", 
        "一居"
    ],    "img_width" : "780",    "fid" : "8a42487a-0b0a-11e7-9f80-94de802721e7"}

三.启动Scrapy

Scrapy提供了 Shell命令来启动爬虫但是如果需求不同，需要自定义逻辑还是在python文件中启动。
为了保证爬虫的长时间工作，需要开启定时任务检查爬虫是否在运行状态，如果爬取已经结束了则重新开始。

import osimport sysimport threadingimport timefrom os.path import dirnamefrom schedule import Schedulerfrom twisted.internet import reactorfrom tubatu import config#需要引用外层文件夹中文件，将工程中的所有文件夹加入环境变量中path = dirname(os.path.abspath(os.path.dirname(__file__)))
sys.path.append(path)class Runner(object):
    def __init__(self):
        self.is_running = False
        #将引擎关闭的回调与自定义关闭爬虫方法绑定
        dispatcher.connect(self.pause_crawler, signals.engine_stopped)        #获取settings.py中配置
        self.setting = get_project_settings()
        self.process = None

    def start_scrapy(self):
        #创建爬虫进程
        self.process = CrawlerProcess(self.setting)
        self.crawl()
        reactor.run()    def pause_crawler(self):
        self.is_running = False
        print("============ 爬虫已停止 ===================")    
    #启动爬虫
    def crawl(self):
        self.is_running = True
        self.process.crawl(DesignPictureSpider())        
    
    #启动IP代理池
    def start_proxy_pool(self):
        from msic.proxy.proxy_pool import proxy_pool        if config.USE_PROXY:
            proxy_pool.start()        else:
            proxy_pool.drop_proxy()    def run(self):
        self.start_proxy_pool()
        self.start_scrapy()if __name__ == '__main__':
    runner = Runner()    def thread_task():
        def task():
            if not runner.is_running:
                print("============ 开始重新爬取 ===================")
                runner.crawl()        
        #创建定时任务，没过30分钟检查爬虫是否关闭了，如果关闭了重新启动
        schedule = Scheduler()
        schedule.every(30).minutes.do(task)        while True:
            schedule.run_pending()
            time.sleep(1)


    thread = threading.Thread(target=thread_task)
    thread.start()

    runner.run()

最后

至此，Python爬虫实战-使用Scrapy框架爬取土巴兔教程已经全部结束。感谢读者的阅读，文章希望能帮到大家。

附：

详细的项目工程在Github中，如果觉得还不错的话记得Star哦。

作者：imflyn
链接：https://www.jianshu.com/p/6345dbb1ad41

點擊查看更多內容

為 TA 點贊

若覺得本文不錯，就分享一下吧！

評論

評論

共同學習，寫下你的評論

評論加載中...

展開查看更多評論

作者其他優質文章

正在加載中

拉風的咖菲貓

手記
篇

粉絲

44

獲贊與收藏

212

關注作者，訂閱最新文章

閱讀免費教程

Python 辦公自動化教程

17個小節 26777 907

Python 算法入門教程

15個小節 29171 1121

Python 進階應用教程

38個小節 69987 1093

推薦

評論

收藏

共同學習，寫下你的評論



感謝您的支持，我會繼續努力的～

掃碼打賞，你說多少就多少

贊賞金額會直接到老師賬戶

支付方式

打開微信掃一掃，即可進行掃碼打賞哦

今天注冊有機會得

100積分直接送

付費專欄免費學

大額優惠券免費領

立即參與放棄機會

點擊
抽獎

慕課手記新用戶專享福利

恭喜你，你的運氣太好了，居然抽中了 100個積分！

恭喜你，抽中了價值元的專欄！

太棒了，直接落到你賬戶里！

積分商城里的羅技鼠標、機械鍵盤、
Kindle 閱讀器、小米平衡車
Apple iPad （10.2英寸）、大額優惠券
在等著你去兌換了噢

作者：

免費贈送

兌換碼：1111222211 復制

優惠券可用于購買實戰課、體系課
無門檻使用

先去看看，有什么好東西馬上兌換我愛學習，選課去


亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

Python爬蟲實戰-使用Scrapy框架爬取土巴兔(五)

一.下载图片

三.启动Scrapy

最后

附：

閱讀免費教程