亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

<menuitem id="qj3tf"><i id="qj3tf"><small id="qj3tf"></small></i></menuitem>

<table id="qj3tf"><dd id="qj3tf"><thead id="qj3tf"></thead></dd></table><label id="qj3tf"><rp id="qj3tf"><dd id="qj3tf"></dd></rp></label>

<label id="qj3tf"></label>

<form id="qj3tf"><big id="qj3tf"><dl id="qj3tf"></dl></big></form>

<fieldset id="qj3tf"><rp id="qj3tf"></rp></fieldset>

<fieldset id="qj3tf"></fieldset>

<button id="qj3tf"><rt id="qj3tf"><tbody id="qj3tf"></tbody></rt></button>

<td id="qj3tf"><i id="qj3tf"></i></td>

我的購物車

已加入門課程

購物車里空空如也

快去這里選購你中意的課程

我的訂單中心

全部開發者教程

Scrapy 入門教程

爬蟲框架基礎篇

Scrapy 爬蟲框架介紹使用 Requests 庫請求網址 Scrapy 默認的網頁解析器 Xpath Redis 數據庫的基本操作 MongoDB 數據庫的基本操作一個簡單的爬蟲實例：互動出版網爬蟲第一個基于 Scrapy 框架的爬蟲

Scrapy 框架初級篇

Scrapy 運行架構與數據處理流程簡介 Scrapy 框架的 Shell 工具使用 Scrapy 常用命令及其分析 Scrapy中的Request和Response Scrapy 中的 Pipline 管道 Scrapy 中的中間件 Scrapy 配置介紹及常見優化配置 Scrapy 抓取起點中文網：實現登錄和認證 Scrapy 抓取今日頭條：抓取每日熱點新聞

Scrapy 框架高級篇

網站反爬蟲繞過技術分析 Splash 服務初體驗深入使用 Splash 服務 Selenium 自動化測試工具介紹 Scrapy與 Selenium 的結合使用 Scrapy 的分布式實現

Scrapy 框架源碼篇

Twisted 框架基礎深入分析 Scrapy 下載器原理深入理解 Scrapy 中間件深入分析 Scrapy 的 Pipeline 原理深入分析 crawl 命令的執行過程

首頁慕課教程 Scrapy 入門教程 Scrapy 默認的網頁解析器 Xpath

沈無奇 · 更新于 2020-08-27

上一節

使用 Requests 庫請求網址

Redis 數據庫的基本操作

下一節

Scrapy 默認的網頁解析器 Xpath

Xpath 是 Scrapy 框架中默認的網頁解析器，只有掌握了 Xpath 選擇器，我們才能快速從網頁元素中提取我們想要的數據。

1. xpath 選擇器介紹

首先來看看 Xpath 的字面介紹：

XPath 即為 XML 路徑語言（XML Path Language），它是一種用來確定XML文檔中某部分位置的語言。 XPath 基于 XML 的樹狀結構，提供在數據結構樹中找尋節點的能力。XQuery 和 XPointer 均構建于 XPath 表達式之上。

來看看 xpath 最常用的路徑表達式規則：

表達式	描述
nodename	選擇此元素的所有子節點
/	從根節點開始選擇
//	從匹配選擇的當前節點選擇文檔中的節點
.	當前節點
…	當前節點的父節點
@	選取屬性

來看下面幾個例子：

路徑表達式	含義
p	選擇所有 p 節點
//body	選擇所有的body元素節點/
//*[@class=“red-color”]/…	選擇所有class屬性值為 “red-color” 節點的父節點

在 xpath 中可以使用通配符來提取相關節點元素：

路徑表達式	含義
//*	找出所有節點
//[@]	匹配任何有屬性的節點
//*[@class=“red-color”]	提取所有class屬性值為 “red-color” 的節點

另外，在 xpath 中我們還可以使用運算符，來輔助選取節點：

路徑表達式	含義
//div \| //p	選取div或者p元素的節點
//p[1 + 1]/text()	獲取第二個p元素節點的文本值
//*[@value > 10]	找出所有 value 值大于10的節點

其中 xpath 支持的表達式除了 +、- *、div 和 mod 等基本運算符外，還有比較運算符，如 =、!=、>=、<=、> 、> 、and、or等。

在 xpath 中有一個叫做軸的概念，表示相對于當前節點的節點集。下面是一些基本軸的定義：

軸名稱	含義
ancestor	選取當前節點的所有先輩（父、祖父等）
ancestor-or-self	選取當前節點的所有先輩（父、祖父等）以及當前節點本身
attribute	選取當前節點的所有屬性
child	選取當前節點的所有子元素
descendant	選取當前節點的所有后代元素（子、孫等）
descendant-or-self	選取當前節點的所有后代元素（子、孫等）以及當前節點本身
following	選取文檔中當前節點的結束標簽之后的所有節點
following-sibling	選取文檔中當前節點的結束標簽之后的所有同級節點
parent	選取當前節點的父節點
preceding	選取文檔中當前節點的開始標簽之前的所有節點
preceding-sibling	選取當前節點之前的所有同級節點
self	選取當前節點

軸的用法是：軸名稱::節點測試。來看下面幾個例子：

路徑表達式	含義
//body/div[2]/following-sibling::*	body節點下第二個div節點之后的所有同級節點
//body/p[1]/child::span[last()]/text()	body節點下的第一個p節點下的最后一個span子節點的文本值
//body/p[1]/span/child::text()	body節點下的第一個p節點下的所有span子節點的文本值
//body/p/attribute::*	body節點下所有p節點的屬性值

最后，在 xpath 中還有一些輔助我們更好搜索節點的函數：

函數	含義
starts-with()	獲取某個字符串開頭的節點
contains()	包含某個字符串的節點，可以是屬性包含、文本包含等等
text()	獲取節點的文本值

上述輔助函數的實例如下：

路徑表達式	含義
//p[contains(@class, “red”)]	獲取class屬性值包含"red"的所有p節點值
‘//span[contains(text(), “藍色”)]/text()’	獲取文本值包含"藍色"的所有span節點的文本
‘//span[starts-with(text(), “藍”)]/text()’	獲取文本值以"藍"開頭的所有span節點的文本

此外，當然還有許多方面沒有講到，后續會在實戰中進行說明。上面的基礎部分一定要熟記和靈活運用，足以應付常見的頁面數據提取。下面就進入實戰環節，使用 Python 來實操 xpath 路徑表達式。

2. xpath 解析實戰

lxml 是 Python 中的一個解析庫，支持 HTML 和 XML 的解析，支持 XPath 解析方式，而且解析效率非常高。本節將安裝該模塊解析 html 文本并提取相應的數據。

[store@server2 ~]$ sudo pip3 install lxml
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
Collecting lxml
  Downloading http://mirrors.cloud.aliyuncs.com/pypi/packages/55/6f/c87dffdd88a54dd26a3a9fef1d14b6384a9933c455c54ce3ca7d64a84c88/lxml-4.5.1-cp36-cp36m-manylinux1_x86_64.whl (5.5MB)
    100% |████████████████████████████████| 5.5MB 82.9MB/s 
Installing collected packages: lxml
Successfully installed lxml-4.5.1

我們先準備好素材，也就是要解析的 HTML 文檔。為了更有代入感，我直接使用慕課網 wiki 頁面的數據進行操作，獲取數據的方式如下圖所示：
圖片描述

獲取慕課網 wiki 頁面的 HTML 數據

最后保存到一個 test.html 文本，然后我們要準備一段 Python 代碼：

from lxml import etree

tree = etree.parse('test.html', etree.HTMLParser(encoding='utf8'))

def print_result(exp, results):
    print('xpath表達式為:{}，其匹配結果為:'.format(exp))
    for res in results:
        print(res.strip())
    print('')

def test_xpath_expression(exp):
    results = tree.xpath(exp)
    print_result(exp, results)

將這個 Python 文件命名為 test_xpath.py 和 test.html 放在同一級目錄下：

[store@server2 ~]$ ls
shen  test.html  test_xpath.py

接下來我們就可以進行激動人心的測試了，來完成一個簡單的實驗：

圖片描述

慕課網 wiki 頁面數據獲取

第一個實驗的目標就是拿到 javascript 分類下的教程的三個數據：標題、總節數以及訪問次數。通過 F12 查看相關的 HTML 結構，我們可以通過如下的 Xpath表達式獲取相應的數據：

Python 3.6.8 (default, Apr  2 2020, 13:34:55) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from test_xpath import test_xpath_expression
>>> exp1 = '//h2[@class="language-title"]/text()'
>>> test_xpath_expression(exp1)
xpath表達式為://h2[@class="language-title"]/text()，其匹配結果為:
JavaScript
HTML & CSS
服務器
開發工具
其他后端語言
基礎應用
框架應用
基礎應用
Python Web 開發
MySQL

接下來看一看元素的結構：

圖片描述

javascript 專欄的節點結構

可以看到 javascript 專欄標題是 h2 節點，這個節點同級下有一個 div，它下面的四個 div 節點正是那四個專欄。我們首先匹配下這四個專欄元素：

>>> exp1 = '//h2[contains(text(), "JavaScript")]/following-sibling::div/div[@class="course-card"]'
>>> test_xpath_expression(exp1)
xpath表達式為://h2[contains(text(), "JavaScript")]/following-sibling::div/div[@class="course-card"]，其匹配結果為:
<Element div at 0x7f7015bf8808>
<Element div at 0x7f700c656788>
<Element div at 0x7f700c6567c8>
<Element div at 0x7f700c656808>

那么我們來進一步分析每個 div 內部如何得到教程標題、總節數以及訪問次數這些數據：

圖片描述

獲取教程數據

可以看到，在前面找到 div 節點的基礎上在往下兩層，找到 class 屬性值為 text 的 div 節點，所有的數據都在這個節點中：

標題：上面找到的 div 節點下的第一個 a 節點的文本值；
教程總節數：上面找到的 div 節點下的第一個 p 節點下第一個 span 元素的文本值；
總訪問次數：上面找到的 div 節點下的第一個 p 節點下第二個 span 元素的文本值；

這樣我們就能進行寫出提取相應數據的 Xpath 路徑表達式了，測試如下：

>>> exp1 = '//h2[contains(text(), "JavaScript")]/following-sibling::div/div[@class="course-card"]/child::div/div[@class="text"]/a[1]/text()'
>>> test_xpath_expression(exp1)
xpath表達式為://h2[contains(text(), "JavaScript")]/following-sibling::div/div[@class="course-card"]/child::div/div[@class="text"]/a[1]/text()，其匹配結果為:
Javascript 入門教程
TypeScript 入門教程
Vue 入門教程
Ajax 入門教程

>>> exp2 = '//h2[contains(text(), "JavaScript")]/following-sibling::div/div[@class="course-card"]/child::div/div[@class="text"]/p/span[1]/text()'
>>> test_xpath_expression(exp2)
xpath表達式為://h2[contains(text(), "JavaScript")]/following-sibling::div/div[@class="course-card"]/child::div/div[@class="text"]/p/span[1]/text()，其匹配結果為:
56小節
38小節
39小節
9小節

>>> exp3 = '//h2[contains(text(), "JavaScript")]/following-sibling::div/div[@class="course-card"]/child::div/div[@class="text"]/p/span[2]/text()'
>>> test_xpath_expression(exp3)
xpath表達式為://h2[contains(text(), "JavaScript")]/following-sibling::div/div[@class="course-card"]/child::div/div[@class="text"]/p/span[2]/text()，其匹配結果為:
9832
3547
3628
1800

接下來我們整理下 Python 代碼，將整個 wiki 頁面上的教程都解析出來，并將數據整理成 json 格式。預期最后的結果應該是這樣的：

{
    '前端開發': {
        'JavaScript': [
            {'title': 'JavaScript入門教程', 'total_chapters': 56, 'total_visited': 9001},
            {...},
            {...},
            {...}
        ],
        'HTML & CSS': [ ... ]
    }
    '服務端相關': {
    
    },
    ...
}

這樣的難度再次增加，其核心的獲取數據的過程和上面一致。后面獲取其他數據的結果過程不作分析，大家有興趣仔細研究下代碼，然后動手實操。話不多說，上代碼：

# 代碼文件：test_xpath2.py

from lxml import etree
def get_direction_data(direction_tree):
    """
    獲取一個方向下的課程數據
    :return:
    """
    direction_data = {}
    cards = direction_tree.xpath('.//div[@class="language-card"]')
    for card in cards:
        title = card.xpath('.//h2[@class="language-title"]/text()')[0]
        course_list = card.xpath('.//div[@class="course-card"]')
        courses = []
        for course in course_list:
            course_title = course.xpath('.//div[@class="text"]/a[1]/text()')[0]
            course_total_chaps = course.xpath('.//div[@class="text"]/p/span[1]/text()')[0]
            course_total_visit_count = course.xpath('.//div[@class="text"]/p/span[2]/text()')[0]
            courses.append({
                'course_title': course_title.strip(),
                'course_total_chaps': course_total_chaps.strip(),
                'course_total_visit_count': int(course_total_visit_count.strip())
            })
        direction_data[title] = courses
    return direction_data


def get_all_data():
    """
    解析慕課網wiki數據
    :return:
    """
    result = {}
    html = etree.parse('test.html', etree.HTMLParser(encoding='utf8'))
    directions = html.xpath('//div[@class="direction-con"]')
    for direction in directions:
        # 提取方向key,注意一定要有點號，表示從當前元素開始提取
        direction_name = direction.xpath('./div[@class="title-con"][1]/text()')
        if direction_name:
            result[direction_name[0]] = get_direction_data(direction)
    return result

運行的結果如下：

[store@server2 ~]$ python3
Python 3.6.8 (default, Apr  2 2020, 13:34:55) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from test_xpath2 import get_all_dat
>>> get_all_data()
{'前端開發': {'JavaScript': [{'course_title': 'Javascript 入門教程', 'course_total_chaps': '56小節', 'course_total_visit_count': 9832}, {'course_title': 'TypeScript 入門教程', 'course_total_chaps': '38小節', 'course_total_visit_count': 3547}, {'course_title': 'Vue 入門教程', 'course_total_chaps': '39小節', 'course_total_visit_count': 3628}, {'course_title': 'Ajax 入門教程', 'course_total_chaps': '9小節', 'course_total_visit_count': 1800}], 'HTML & CSS': [{'course_title': 'CSS3 入門教程', 'course_total_chaps': '32小節', 'course_total_visit_count': 1512}, {'course_title': 'Less 入門教程', 'course_total_chaps': '22小節', 'course_total_visit_count': 364}, {'course_title': '雪碧圖入門教程', 'course_total_chaps': '24小節', 'course_total_visit_count': 915}]}, '服務端相關': {'服務器': [{'course_title': 'Nginx 入門教程', 'course_total_chaps': '24小節', 'course_total_visit_count': 4500}, {'course_title': 'HTTP 入門教程', 'course_total_chaps': '16小節', 'course_total_visit_count': 456}, {'course_title': 'Docker 入門教程', 'course_total_chaps': '25小節', 'course_total_visit_count': 1067}, {'course_title': 'Shell 入門教程', 'course_total_chaps': '17小節', 'course_total_visit_count': 2060}, {'course_title': 'Linux 入門教程', 'course_total_chaps': '25小節', 'course_total_visit_count': 1430}], '開發工具': [{'course_title': 'Gradle 入門教程', 'course_total_chaps': '12小節', 'course_total_visit_count': 1121}, {'course_title': 'Vim 入門教程', 'course_total_chaps': '14小節', 'course_total_visit_count': 1491}, {'course_title': 'RESTful 規范教程', 'course_total_chaps': '13小節', 'course_total_visit_count': 1316}, {'course_title': 'Markdown 入門教程', 'course_total_chaps': '31小節', 'course_total_visit_count': 733}, {'course_title': 'Maven 入門教程', 'course_total_chaps': '17小節', 'course_total_visit_count': 155}, {'course_title': 'GitHub 入門教程', 'course_total_chaps': '9小節', 'course_total_visit_count': 261}], '其他后端語言': [{'course_title': 'C 語言入門教程', 'course_total_chaps': '45小節', 'course_total_visit_count': 1933}, {'course_title': 'Go 入門教程', 'course_total_chaps': '36小節', 'course_total_visit_count': 691}, {'course_title': 'Ruby 入門教程', 'course_total_chaps': '26小節', 'course_total_visit_count': 410}]}, 'Java': {'基礎應用': [{'course_title': 'Java 入門教程', 'course_total_chaps': '39小節', 'course_total_visit_count': 5229}, {'course_title': 'Android 入門教程', 'course_total_chaps': '29小節', 'course_total_visit_count': 553}, {'course_title': '算法入門教程', 'course_total_chaps': '11小節', 'course_total_visit_count': 628}], '框架應用': [{'course_title': 'Spring Boot 入門教程', 'course_total_chaps': '25小節', 'course_total_visit_count': 4861}, {'course_title': 'Spring 入門教程', 'course_total_chaps': '21小節', 'course_total_visit_count': 850}, {'course_title': 'Hibernate 入門教程', 'course_total_chaps': '23小節', 'course_total_visit_count': 619}, {'course_title': 'MyBatis 入門教程', 'course_total_chaps': '23小節', 'course_total_visit_count': 895}]}, 'Python': {'基礎應用': [{'course_title': 'Python 入門語法教程', 'course_total_chaps': '24小節', 'course_total_visit_count': 3617}, {'course_title': 'Python 原生爬蟲教程', 'course_total_chaps': '19小節', 'course_total_visit_count': 2001}, {'course_title': 'Python 進階應用教程', 'course_total_chaps': '29小節', 'course_total_visit_count': 726}], 'Python Web 開發': [{'course_title': 'Django 入門教程', 'course_total_chaps': '33小節', 'course_total_visit_count': 668}, {'course_title': 'NumPy 入門教程', 'course_total_chaps': '21小節', 'course_total_visit_count': 152}]}, '數據庫': {'MySQL': [{'course_title': 'MySQL 入門教程', 'course_total_chaps': '32小節', 'course_total_visit_count': 3638}, {'course_title': 'SQL 入門教程', 'course_total_chaps': '47小節', 'course_total_visit_count': 2406}]}}

是不是實現了預期效果？爬取網頁，解析數據的過程和這個類似。掌握好今天的內容，你就已經掌握了爬蟲的一個核心步驟。

3. 小結

本小節中，我們重點介紹了 Xpath 選擇器的一些基本知識，包括通用的路徑表達式規則、運輸符、軸的概念以及 Xpath 選擇器中常用的輔助函數。接下來我們用一段 Html 文本結合 Python 代碼進行了實戰演示，幫助我們更好的理解 xpath 選擇器，本節課程就到這里，希望大家有所收獲。

圖片描述

上一節

使用 Requests 庫請求網址

下一節

Redis 數據庫的基本操作

我要提出意見反饋

索引目錄

Scrapy 默認的網頁解析器 Xpath

1. xpath 選擇器介紹

2. xpath 解析實戰

購課補貼
聯系客服咨詢優惠詳情

幫助反饋 APP下載

慕課網APP
您的移動學習伙伴

掃描二維碼
關注慕課網微信公眾號

<fieldset id="k1huj"><table id="k1huj"></table></fieldset>