首頁慕課教程 Scrapy 入門教程 Scrapy 抓取今日頭條：抓取每日熱點新聞

沈無奇 · 更新于 2020-09-11

Scrapy 抓取起點中文網：實現登錄和認證

網站反爬蟲繞過技術分析

Scrapy 抓取今日頭條：抓取每日熱點新聞

今天我們來基于 Scrapy 框架完成一個新聞數據抓取爬蟲，本小節中我們將進一步學習 Scrapy 框架的，來抓取異步 ajax 請求的數據，同時學習 Scrapy 的日志配置、郵件發送等功能。

1. 今日頭條熱點新聞數據抓取分析

今天的爬取對象是今日頭條的熱點新聞，下面的視頻演示了如何找到頭條新聞網站在獲取熱點新聞的 HTTP 請求：

從視頻中我們可以看到頭條新聞獲取網站的接口示例如下：

https://www.toutiao.com/api/pc/feed/?category=news_hot&utm_source=toutiao&widen=1&max_behot_time=1597152177&max_behot_time_tmp=1597152177&tadrequire=true&as=A1955F33D209BD8&cp=5F32293B3DE80E1&_signature=_02B4Z6wo0090109cl1gAAIBCcqbHy0H-dDdPWZPAAIzuFTZSh6NBsUuEpf13PktqrmxS-ZD4dEDZ6Ezcpyjo31hg62slsekkigwdRlS0FHfPsOvx.KRyeJBdEf5QI8nLcwEMyziL1YdPK6VD8f

像這樣的 http 請求時比較難模擬的，我們需要知道請求中所有參數的獲取規則，特別是一些進行加密的方式，需要從前端中找出來并手工實現。比如這里的 URL，前幾個參數都是固定值，其中 as、cp 和 _signature 則非常難獲取，需要有極強的前端功底，網上也有大神對這些值的生成進行了分析和解密，當然這些不是我們學習的重點。

最后一個問題：一次請求得到10條左右的新聞數據，那么像實現視頻中那樣更新更多新聞的請求，該如何完成呢？仔細分析下連續的刷新請求，我們會發現上述的 URL 請求結果中有這樣一個參數：max_behot_time。

圖片描述

第一次請求max_behot_time值為0

圖片描述

next中的max_behot_time等于最后一條數據的behot_time值

關于這個參數，我們得到兩條信息：

第一次請求熱點新聞數據時，該參數為0；
接下來的每次請求，帶上的 max_behot_time 值為上一次請求熱點新聞數據結果中的 next 字段中的 max_behot_time 鍵對應的值。它表示的是一個時間戳，其實就是意味著請求的熱點新聞數據需要在這個時間之后；

有了這樣的信息，我們來基于 requests 庫，純手工實現一把頭條熱點新聞數據的抓取。我們按照如下的步驟來完成爬蟲代碼：

準備基本變量，包括請求的基本 URL、請求參數、請求頭等；

hotnews_url = "https://www.toutiao.com/api/pc/feed/?"

params = {
    'category': 'news_hot',
    'utm_source': 'toutiao',
    'widen': 1,
    'max_behot_time': '',
    'max_behot_time_tmp': '',
}

headers = {
    'referer': 'https://www.toutiao.com/ch/news_hot/',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36'
}
cookies = {'tt_webid':'6856365980324382215'} 
max_behot_time = '0'

注意：上面的 cookies 中的 tt_webid 字段值可以通過右鍵看到，不過用處不大。

圖片描述

tt_webid值的獲取

準備三個個方法：get_request_data() 、get_as_cp() 和 save_to_json()。其中第二個函數是網上有人對頭條的 js 生成 as 和 cp 參數的代碼進行了翻譯，目前看來似乎還能使用；

def get_request_data(url, headers):
    response = requests.get(url=url, headers=headers)
    return json.loads(response.text)


def get_as_cp():  
    # 該函數主要是為了獲取as和cp參數，程序參考今日頭條中的加密js文件：home_4abea46.js
    zz = {}
    now = round(time.time())
    e = hex(int(now)).upper()[2:] 
    a = hashlib.md5() 
    a.update(str(int(now)).encode('utf-8'))
    i = a.hexdigest().upper()
    if len(e) != 8:
        zz = {'as':'479BB4B7254C150',
        'cp':'7E0AC8874BB0985'}
        return zz
    n = i[:5]
    a = i[-5:]
    r = ''
    s = ''
    for i in range(5):
        s = s + n[i] + e[i]
    for j in range(5):
        r = r + e[j + 3] + a[j]
    zz ={
        'as': 'A1' + s + e[-3:],
        'cp': e[0:3] + r + 'E1'
    }
    return zz


def save_to_json(datas, file_path, key_list):
    """
    保存 json 數據
    """
    print('寫入數據到文件{}中，共計{}條新聞數據!'.format(file_path, len(datas)))
    with codecs.open(file_path, 'a+', 'utf-8') as f:
        for d in datas:
            cleaned_data = {}
            for key in key_list:
                if key in d:
                    cleaned_data[key] = d[key]
            print(json.dumps(cleaned_data, ensure_ascii=False))
            f.write("{}\n".format(json.dumps(cleaned_data, ensure_ascii=False)))

最后一步就是實現模擬刷新請求數據。下一次的請求會使用上一次請求結果中的 max_behot_time 值，這樣能連續獲取熱點新聞數據，模擬頭條頁面向下的刷新過程；

# 模擬向下下刷新5次獲取新聞數據
refresh_count = 5
for _ in range(refresh_count):
    new_params = copy.deepcopy(params)
    zz = get_as_cp()
    new_params['as'] = zz['as']
    new_params['cp'] = zz['cp']
    new_params['max_behot_time'] = max_behot_time
    new_params['max_behot_time_tmp'] = max_behot_time
    request_url = "{}{}".format(hotnews_url, urlencode(new_params))
    print(f'本次請求max_behot_time = {max_behot_time}')
    datas = get_request_data(request_url, headers=headers, cookies=cookies)
    max_behot_time = datas['next']['max_behot_time']
    save_to_json(datas['data'], "result.json", key_list)

    time.sleep(2)

最后來看看完整抓取熱點新聞數據的代碼運行過程，如下：

2. 基于 Scrapy 框架的頭條熱點新聞數據爬取

還是按照我們以前的套路來進行，第一步是使用 startproject 命令創建熱點新聞項目：

[root@server ~]# cd scrapy-test/
[root@server scrapy-test]# pyenv activate scrapy-test
pyenv-virtualenv: prompt changing will be removed from future release. configure `export PYENV_VIRTUALENV_DISABLE_PROMPT=1' to simulate the behavior.
(scrapy-test) [root@server scrapy-test]# scrapy startproject toutiao_hotnews
New Scrapy project 'toutiao_hotnews', using template directory '/root/.pyenv/versions/3.8.1/envs/scrapy-test/lib/python3.8/site-packages/scrapy/templates/project', created in:
    /root/scrapy-test/toutiao_hotnews

You can start your first spider with:
    cd toutiao_hotnews
    scrapy genspider example example.com
(scrapy-test) [root@server scrapy-test]#

接著，根據我們要抓取的新聞數據字段，先定義好 Item：

import scrapy


class ToutiaoHotnewsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    abstract = scrapy.Field()
    source = scrapy.Field()  
    source_url = scrapy.Field()
    comments_count = scrapy.Field()
    behot_time = scrapy.Field()

有了 Item 之后，我們需要新建一個 Spider，可以使用 genspider 命令生成，也可以手工編寫一個 Python 文件，代碼內容如下：

# 代碼位置：toutiao_hotnews/toutiao_hotnews/spiders/hotnews.py
import copy
import hashlib
from urllib.parse import urlencode
import json
import time

from scrapy import Request, Spider

from toutiao_hotnews.items import ToutiaoHotnewsItem


hotnews_url = "https://www.toutiao.com/api/pc/feed/?"
params = {
    'category': 'news_hot',
    'utm_source': 'toutiao',
    'widen': 1,
    'max_behot_time': '',
    'max_behot_time_tmp': '',
    'as': '',
    'cp': ''
}
headers = {
    'referer': 'https://www.toutiao.com/ch/news_hot/',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36'
}
cookies = {'tt_webid':'6856365980324382215'} 
max_behot_time = '0'

def get_as_cp():  
    # 該函數主要是為了獲取as和cp參數，程序參考今日頭條中的加密js文件：home_4abea46.js
    zz = {}
    now = round(time.time())
    e = hex(int(now)).upper()[2:] 
    a = hashlib.md5() 
    a.update(str(int(now)).encode('utf-8'))
    i = a.hexdigest().upper()
    if len(e) != 8:
        zz = {'as':'479BB4B7254C150',
        'cp':'7E0AC8874BB0985'}
        return zz
    n = i[:5]
    a = i[-5:]
    r = ''
    s = ''
    for i in range(5):
        s = s + n[i] + e[i]
    for j in range(5):
        r = r + e[j + 3] + a[j]
    zz ={
        'as': 'A1' + s + e[-3:],
        'cp': e[0:3] + r + 'E1'
    }
    return zz


class HotnewsSpider(Spider):
    name = 'hotnews'
    allowed_domains = ['www.toutiao.com']
    start_urls = ['http://www.toutiao.com/']
    # 記錄次數，注意停止
    count = 0

    def _get_url(self, max_behot_time):
        new_params = copy.deepcopy(params)
        zz = get_as_cp()
        new_params['as'] = zz['as']
        new_params['cp'] = zz['cp']
        new_params['max_behot_time'] = max_behot_time
        new_params['max_behot_time_tmp'] = max_behot_time
        return  "{}{}".format(hotnews_url, urlencode(new_params))
       
    def start_requests(self):
        """
        第一次爬取
        """
        request_url = self._get_url(max_behot_time)
        self.logger.info(f"we get the request url : {request_url}")
        yield Request(request_url, headers=headers, cookies=cookies, callback=self.parse)

    def parse(self, response):
        """
        根據得到的結果得到獲取下一次請求的結果
        """
        self.count += 1
        datas = json.loads(response.text)
        data = datas['data']
        for d in data:
            item = ToutiaoHotnewsItem()
            item['title'] = d['title']
            item['abstract'] = d.get('abstract', '')
            item['source'] = d['source']
            item['source_url'] = d['source_url']
            item['comments_count'] = d.get('comments_count', 0)
            item['behot_time'] = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(d['behot_time']))
            self.logger.info(f'得到的item={item}')
            yield item

        if self.count < self.settings['REFRESH_COUNT']:
            max_behot_time = datas['next']['max_behot_time']
            self.logger.info("we get the next max_behot_time: {}, and the count is {}".format(max_behot_time, self.count))
            yield Request(self._get_url(max_behot_time), headers=headers, cookies=cookies)

這里的代碼之前一樣，第一次構造 Request 請求在 start_requests() 方法中，接下來在根據每次請求結果中獲取 max_behot_time 值再進行下一次請求。另外我使用了全局計算變量 count 來模擬刷新的次數，控制請求熱點新聞次數，防止無限請求下去。此外，Scrapy logger 在每個 spider 實例中提供了一個可以訪問和使用的實例，我們再需要打印日志的地方直接使用 self.logger 即可，它對應日志的配置如下：

# 代碼位置：toutiao_hotnews/settings.py
# 注意設置下下載延時
DOWNLOAD_DELAY = 5
# ...
#是否啟動日志記錄，默認True
LOG_ENABLED = True 
LOG_ENCODING = 'UTF-8'
#日志輸出文件，如果為NONE，就打印到控制臺
LOG_FILE = 'toutiao_hotnews.log'
#日志級別，默認DEBUG
LOG_LEVEL = 'INFO'
# 日志日期格式 
LOG_DATEFORMAT = "%Y-%m-%d %H:%M:%S"
#日志標準輸出，默認False，如果True所有標準輸出都將寫入日志中，比如代碼中的print輸出也會被寫入到
LOG_STDOUT = False

接下來是 Item Pipelines 部分，這次我們將抓取到的新聞保存到 MySQL 數據庫中。此外，我們還有一個需求就是選擇當前最新的10條新聞發送到本人郵件，這樣每天早上就能定時收到最新的頭條新聞，豈不美哉。首先我想給自己的郵件發送 HTML 格式的數據，然后列出最新的10條新聞，因此第一步是是準備好模板熱點新聞的模板頁面，具體模板頁面如下：

# 代碼位置: toutiao_hotnews/html_template.py
hotnews_template_html = """
<!DOCTYPE html>
<html>
<head>
	<title>頭條熱點新聞一覽</title>
</head>
<style type="text/css">
</style>
<body>
<div class="container">
<h3 style="margin-bottom: 10px">頭條熱點新聞一覽</h3>
$news_list
</div>
</body>
</html>
"""

要注意一點，Scrapy 的郵箱功能只能發送文本內容，不能發送 HTML 內容。為了能支持發送 HTML 內容，我繼承了原先的 MailSender 類，并對原先的 send() 方法稍做改動：

# 代碼位置: mail.py

import logging 
from email import encoders as Encoders
from email.mime.base import MIMEBase
from email.mime.multipart import MIMEMultipart
from email.mime.nonmultipart import MIMENonMultipart
from email.mime.text import MIMEText
from email.utils import COMMASPACE, formatdate

from scrapy.mail import MailSender
from scrapy.utils.misc import arg_to_iter

logger = logging.getLogger(__name__)

class HtmlMailSender(MailSender):
    def send(self, to, subject, body, cc=None, mimetype='text/plain', charset=None, _callback=None):
        from twisted.internet import reactor
         
        #####去掉了與attachs參數相關的判斷語句,其余代碼不變#############
        msg = MIMEText(body, 'html', 'utf-8')
        ##########################################################

        to = list(arg_to_iter(to))
        cc = list(arg_to_iter(cc))

        msg['From'] = self.mailfrom
        msg['To'] = COMMASPACE.join(to)
        msg['Date'] = formatdate(localtime=True)
        msg['Subject'] = subject
        rcpts = to[:]
        if cc:
            rcpts.extend(cc)
            msg['Cc'] = COMMASPACE.join(cc)

        if charset:
            msg.set_charset(charset)

        if _callback:
            _callback(to=to, subject=subject, body=body, cc=cc, attach=attachs, msg=msg)

        if self.debug:
            logger.debug('Debug mail sent OK: To=%(mailto)s Cc=%(mailcc)s '
                         'Subject="%(mailsubject)s" Attachs=%(mailattachs)d',
                         {'mailto': to, 'mailcc': cc, 'mailsubject': subject,
                          'mailattachs': len(attachs)})
            return

        dfd = self._sendmail(rcpts, msg.as_string().encode(charset or 'utf-8'))
        dfd.addCallbacks(
            callback=self._sent_ok,
            errback=self._sent_failed,
            callbackArgs=[to, cc, subject, len(attachs)],
            errbackArgs=[to, cc, subject, len(attachs)],
        )
        reactor.addSystemEventTrigger('before', 'shutdown', lambda: dfd)
        return dfd

緊接著就是我們的 pipelines.py 文件中的代碼：

import logging
from string import Template
from itemadapter import ItemAdapter
import pymysql


from toutiao_hotnews.mail import HtmlMailSender
from toutiao_hotnews.items import ToutiaoHotnewsItem
from toutiao_hotnews.html_template import hotnews_template_html
from toutiao_hotnews import settings

class ToutiaoHotnewsPipeline:
    logger = logging.getLogger('pipelines_log')

    def open_spider(self, spider):
        # 使用自己的MailSender類
        self.mailer = HtmlMailSender().from_settings(spider.settings)
        # 初始化連接數據庫
        self.db = pymysql.connect(
            host=spider.settings.get('MYSQL_HOST', 'localhost'),                 
            user=spider.settings.get('MYSQL_USER', 'root'),
            password=spider.settings.get('MYSQL_PASS', '123456'),
            port=spider.settings.get('MYSQL_PORT', 3306),
            db=spider.settings.get('MYSQL_DB_NAME', 'mysql'),
            charset='utf8'
        ) 
        self.cursor = self.db.cursor()

    def process_item(self, item, spider):
        # 插入sql語句
        sql = "insert into toutiao_hotnews(title, abstract, source, source_url, comments_count, behot_time) values (%s, %s, %s, %s, %s, %s)"
        if item and isinstance(item, ToutiaoHotnewsItem):
            self.cursor.execute(sql, (item['title'], item['abstract'], item['source'], item['source_url'], item['comments_count'], item['behot_time']))
        return item

    def query_data(self, sql):
        data = {}
        try:
            self.cursor.execute(sql)
            data = self.cursor.fetchall()
        except Exception as e:
            logging.error('database operate error:{}'.format(str(e)))
            self.db.rollback()
        return data

    def close_spider(self, spider):
        sql = "select  title, source_url, behot_time from toutiao_hotnews where 1=1 order by behot_time limit 10"
        # 獲取10條最新的熱點新聞
        data = self.query_data(sql)
        news_list = ""
        # 生成html文本主體
        for i in range(len(data)):
            news_list += "<div><span>{}、<a href=https://www.toutiao.com{}>{} [{}]</a></span></div>".format(i + 1, data[i][1], data[i][0], data[i][2])
        msg_content = Template(hotnews_template_html).substitute({"news_list": news_list})
        self.db.commit()
        self.cursor.close()
        self.db.close()
        self.logger.info("最后統一發送郵件")
        # 必須加return，不然會報錯
        return self.mailer.send(to=["[email protected]"], subject="這是一個測試", body=msg_content, cc=["[email protected]"])

這里我們會將 MySQL 的配置統一放到 settings.py 文件中，然后使用 spider.settings 來讀取響應的信息。其中 open_spider() 方法用于初始化連接數據庫，process_item() 方法用于生成 SQL 語句并提交插入動作，最后的 close_spider() 方法用于提交數據庫執行動作、關閉數據庫連接以及發送統一新聞熱點郵件。下面是我們將這個 Pipeline 在 settings.py 中開啟以及配置數據庫信息、郵件服務器信息，同時也要注意關閉遵守 Robot 協議，這樣爬蟲才能正常執行。


ROBOTSTXT_OBEY = False

# 啟動對應的pipeline
ITEM_PIPELINES = {
   'toutiao_hotnews.pipelines.ToutiaoHotnewsPipeline': 300,
}

# 數據庫配置
MYSQL_HOST = "180.76.152.113"
MYSQL_PORT = 9002
MYSQL_USER = "store"
MYSQL_PASS = "數據庫密碼"
MYSQL_DB_NAME = "ceph_check"

# 郵箱配置
MAIL_HOST = 'smtp.qq.com'
MAIL_PORT = 25
MAIL_FROM = '[email protected]'
MAIL_PASS = '你的授權碼'
MAIL_USER = '[email protected]'

來看看我們這個頭條新聞爬蟲的爬取效果，視頻演示如下：

3. 小結

本小節中我們繼續帶領大家完成一個 Scrapy 框架的實戰案例，繼續學習了 Scrapy 中關于日志的配置、郵件發送等功能。這一小節，大家有收獲了嗎？

Scrapy 抓取起點中文網：實現登錄和認證

網站反爬蟲繞過技術分析

我要提出意見反饋

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索 清空

我的購物車

已加入門課程

購物車里空空如也

Scrapy 入門教程

前端開發

JavaScript

JavaScript 入門教程

TypeScript 入門教程

Vue 入門教程

Ajax 入門教程

ES6-10 入門教程

Yarn 入門教程

ECharts 入門教程

HTML & CSS

CSS3 入門教程

雪碧圖入門教程

移動端布局教程

Html5 入門教程

Sass 入門教程

HTML 入門教程

canvas 入門教程

uni-app 入門教程

服務端相關

服務器

Nginx 入門教程

HTTP 入門教程

Docker 入門教程

Shell 入門教程

Linux 入門教程

開發工具

Gradle 入門教程

Vim 編輯器教程

RESTful 規范教程

Dreamweaver 教程

Markdown 入門教程

Maven 入門教程

Eclipse 編輯器教程

GitHub 入門教程

Android Studio 編輯器教程

PyCharm 編輯器教程

Sublime Text 使用教程

Postman 教程

Git入門教程

熱門服務端語言

C 語言入門教程

Go 入門教程

Kotlin 教程

Ruby 入門教程

ThinkPHP 入門教程

Java

基礎應用

Java 入門教程

Android 入門教程

算法入門教程

數據結構入門教程

Lambda 表達式教程

Java 并發原理入門教程

設計模式入門教程

Java并發工具

JVM 入門教程

RabbitMQ 入門教程

網絡編程入門教程

后端通用面試教程

框架應用

Spring Boot 入門教程

Spring 入門教程

Hibernate 入門教程

MyBatis 入門教程

Spring MVC 入門教程

Swagger 入門教程

Zookeeper 入門教程

Netty 教程

Spring Security

微服務

Spring Cloud Hystrix

Python

基礎應用

最近搜索清空