伊人丁香百合亚洲,狠狠射干天天爱在线视频,久久亚洲伊人色综

請求已經被實現，而且有一個新的資源已經依據請求的需要而建立，且其 URI 已經隨 Location 頭信息返回。適用場景API 請求創建一個資源對象，返回了新資源對象的地址。目前開發中大部分是新增一個資源返回這個資源的 ID ，然后根據 ID 再查詢詳情。Http 的很多狀態碼都定很細，實踐中并不都那么遵守理論。客戶端POST /add-article HTTP/1.1Content-Type: application/json{ "article": "http" }服務端HTTP/1.1 201 CreatedLocation: /article/01

5.1 根據需求制定 RESTful 風格的接口文檔

既然是要做商品瀏覽頁面，將商品增刪改查都實現了就是了。 RESTful 風格接口并不麻煩，一般情況下需要項目團隊一起商量制定。此處我們指定如下：動詞接口含義接口地址 GET 查詢商品 (id=1) 信息 http://127.0.0.1:8080/goods/1GET 查詢商品列表信息 http://127.0.0.1:8080/goodsPOST 新增商品 http://127.0.0.1:8080/goodsPUT 修改商品 (id=1) 信息 http://127.0.0.1:8080/goods/1DELETE 刪除商品 (id=1)http://127.0.0.1:8080/goods/1Tips： RESTful 風格通過 HTTP 動詞（ GET / POST / PUT / DELETE ）區分操作類型， URL 格式比較固定，僅此而已，非常簡單。

下節預告

除了 HTML/CSSJS 的知識外，學習 Web 開發還需要對 HTTP 協議有一定的了解，HTTP 協議同樣是 Web 開發必備基礎知識，下節課我們就來學習下 HTTP 協議以及 HTTP 在 Web 開發中所起到的作用。不僅如此，下節課會給給大家進行一個 Web 開發常見概念的普及，讓大家對 Web 開發有一個更清晰的了解。

ES6+ <a href="http://Object.is">Object.is</a>()

2. 在 Flask 中分析 URL 參數

服務端收到將客戶端發送的數據后，封裝形成一個請求對象，在 Flask 中，請求對象是一個模塊變量 flask.request，request 對象包含了眾多的屬性。假設 URL 等于 http://localhost/query?userId=123，則與 URL 參數相關的屬性如下：屬性說明urlhttp://localhost/query?userId=123base_urlhttp://localhost/queryhostlocalhosthost_urlhttp://localhost/path/queryfull_path/query?userId=123下面編寫一個 Flask 程序 request.py，打印 request 中和 URL 相關的屬性：#!/usr/bin/python3from flask import Flaskfrom flask import requestapp = Flask(__name__)def echo(key, value): print('%-10s = %s' % (key, value))@app.route('/query')def query(): echo('url', request.url) echo('base_url', request.base_url) echo('host', request.host) echo('host_url', request.host_url) echo('path', request.path) echo('full_path', request.full_path) print() print(request.args) print('userId = %s' % request.args['userId']) return 'hello'if __name__ == '__main__': app.run(port = 80)在第 10 行，定義路徑 /query 的處理函數 query()；在第 11 行到第 16 行，打印 request 對象中和 URL 相關的屬性；URL 中的查詢參數保存在 request.args 中，在第 20 行，打印查詢參數 userId 的值。在瀏覽器中輸入 http://localhost/query?userId=123，Flask 程序在終端輸出如下：url = http://localhost/query?userId=123base_url = http://localhost/queryhost = localhosthost_url = http://localhost/path = /queryfull_path = /query?userId=123ImmutableMultiDict([('userId', '123')])userId = 123

1.3 服務端支持

服務器端需要對客戶端發起的 HTTP 請求做相應的回復，主要是將 HTTP 報文頭的 content-type 字段設置成 text/event-stream，下邊以 PHP 舉例：1129

1.4 server 指令

Syntax: server { ... }Default: —Context: http這里 server 的上下文環境是 http，這說明 server 指令塊只能出現在http指令塊中，否則會出錯。server 指令塊中也是許多指令的集合，比如listen指令，表示監聽 http 請求的端口，還有 server_name、root、index 等指令。...http { server { # 監聽端口 listen 8089; server_name localhost; # 今天資源根路徑 root /data/yum_source; # 打開目錄瀏覽功能 autoindex on; # 指定網站初始頁，找index.html或者index.htm頁面 index index.html index.htm; } ...}...下面我們初步了解下 Nginx 的在一些場景下的配置，使用到的都是一些簡單的配置指令。

4.9 開發商品控制器類

我們還是遵循之前的 RESTful 風格，制定后端訪問接口如下：動詞接口含義接口地址GET查詢商品(id=1)信息http://127.0.0.1:8080/goods/1GET查詢商品列表信息http://127.0.0.1:8080/goodsPOST新增商品http://127.0.0.1:8080/goodsPUT修改商品(id=1)信息http://127.0.0.1:8080/goods/1DELETE刪除商品(id=1)http://127.0.0.1:8080/goods/1我們根據上面的接口列表，實現控制器類代碼如下：實例：/** * 商品控制器類 */@RestControllerpublic class GoodsController { @Autowired private GoodsService goodsService; /** * 按id獲取商品信息 */ @GetMapping("/goods/{id}") public GoodsDo getOne(@PathVariable("id") long id) { return goodsService.getById(id); } /** * 獲取商品列表 */ @GetMapping("/goods") public List<GoodsDo> getList() { return goodsService.getList(); } /** * 新增商品 */ @PostMapping("/goods") public void add(@RequestBody GoodsDo goods) { goodsService.add(goods); } /** * 編輯商品 */ @PutMapping("/goods/{id}") public void update(@PathVariable("id") long id, @RequestBody GoodsDo goods) { // 修改指定id的博客信息 goods.setId(id); goodsService.edit(goods); } /** * 移除商品 */ @DeleteMapping("/goods/{id}") public void delete(@PathVariable("id") long id) { goodsService.remove(id); }}

2.1 創建 Spring Boot web 服務端應用

工程目錄結構如下：? OAuth2ResourceServer/ ? src/ ? main/ ? java/imooc/springsecurity/oauth2/server/ ? config/ OAuth2ResourceServerController.java # 配置控制器，用來扮演資源 OAuth2ResourceServerSecurityConfiguration.java # 資源服務器相關配置均在此處 OAuth2ResourceServerApplication.java # 啟動類 ? resources/ application.yml # 配置 OAuth2.0 認證服務器的地址等信息 ? test/ pom.xml在 pom.xml 文件中增加依賴項，相比「用戶名密碼認證實例」，此處注意添加了 OAuth2 自動配置的相關依賴。spring-security-oauth2-autoconfigure 和 spring-security-oauth2-resource-server。完整 pom.xml 文件如下：<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <parent> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-parent</artifactId> <version>2.3.1.RELEASE</version> <relativePath/>  </parent> <groupId>org.example</groupId> <artifactId>OAuth2ResourceServer</artifactId> <version>0.0.1-SNAPSHOT</version> <properties> <java.version>1.8</java.version> </properties> <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <dependency> <groupId>org.springframework.security</groupId> <artifactId>spring-security-oauth2-resource-server</artifactId> <version>5.3.2.RELEASE</version> </dependency> <dependency> <groupId>org.springframework.security.oauth.boot</groupId> <artifactId>spring-security-oauth2-autoconfigure</artifactId> <version>2.2.5.RELEASE</version> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> </plugin> </plugins> </build></project>創建 SpringSecurity OAuth2 資源服務器配置類，src/main/java/imooc/springsecurity/oauth2/server/OAuth2ResourceServerSecurityConfiguration.java。使其繼承 org.springframework.security.config.annotation.web.configuration.WebSecurityConfigurerAdapter 類，并其增加 @EnableResourceServer 標簽，以聲明此類作為 OAuth2 資源服務器的配置依據；在 configure(HttpSecurity http) 方法中配置其資源的訪問權限，本例中默認所有資源需要認證用戶才能訪問；完整代碼如下：package imooc.springsecurity.oauth2.server.config;import org.springframework.context.annotation.Configuration;import org.springframework.security.config.annotation.web.builders.HttpSecurity;import org.springframework.security.config.annotation.web.configuration.WebSecurityConfigurerAdapter;import org.springframework.security.oauth2.config.annotation.web.configuration.EnableResourceServer;@Configuration@EnableResourceServerpublic class OAuth2ResourceServerSecurityConfiguration extends WebSecurityConfigurerAdapter { @Override protected void configure(HttpSecurity http) throws Exception { http .authorizeRequests(authorizeRequests -> authorizeRequests.anyRequest().authenticated() ) .csrf().disable(); }}在 application.yml 文件中，需要將 OAuth2.0 認證服務器的信息配置進去。server: port: 8081security: oauth2: client: client-id: reader # 客戶端標識，與認證服務器中的寫法相同 client-secret: secret # 客戶端秘鑰，與認證服務器中的寫法相同 user-authorization-uri: http://localhost:8080/oauth/authorize # 客戶端鑒權地址 access-token-uri: http://localhost:8080/oauth/token # 客戶端獲取 Token 地址 resource: id: reader # 資源服務器標識，這里可以根據業務情況填寫 token-info-uri: http://localhost:8080/oauth/check_token # 驗證 Token 的地址至此，資源服務器的核心內容均配置完成。

206 Partial Content

客戶端對服務端的資源進行了某一部分的請求，服務端正常執行，響應報文中包含由 Content-Range 指定范圍的實體內容?？蛻舳薌ET /imooc/video.mp4 HTTP/1.1Range: bytes=1048576-2097152服務端HTTP/1.1 206 Partial ContentContent-Range: bytes 1048576-2097152/3145728Content-Type: video/mp4

4. 案例演示

我們在 nginx.conf 中添加如下的日志配置:...http { log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" "$http_x_forwarded_for"'; map $status $loggable { ~^[34] 0; default 1; } access_log logs/access.log main if=$loggable; server { listen 8000; return 200 '8000, server\n'; } server { listen 8001; return 300 '8001, server\n'; } server { listen 8002; return 401 '8002, server\n'; } ... }...這里我們綜合了前面涉及的知識，這里只簡單測試日志配置中 if 功能。我們設置請求的相應碼為 3xx 和 4xx 時，日志不會記錄。接下來，啟動或者熱加載 Nginx，然后分別對應三個端口發送 http 請求并觀察 access.log 日志:[shen@shen ~]$ curl http://180.76.152.113:8000 -IHTTP/1.1 200 OKServer: nginx/1.17.6Date: Tue, 04 Feb 2020 13:31:03 GMTContent-Type: application/octet-streamContent-Length: 13Connection: keep-alive[shen@shen ~]$ curl http://180.76.152.113:8001 -IHTTP/1.1 300 Server: nginx/1.17.6Date: Tue, 04 Feb 2020 13:31:06 GMTContent-Type: application/octet-streamContent-Length: 13Connection: keep-alive[shen@shen ~]$ curl http://180.76.152.113:8002 -IHTTP/1.1 401 UnauthorizedServer: nginx/1.17.6Date: Tue, 04 Feb 2020 13:31:08 GMTContent-Type: application/octet-streamContent-Length: 13Connection: keep-alive# 到 Nginx 主機上觀察 access.log 日志，發現只有響應碼為200的請求記錄了日志[root@server nginx]# tail -f logs/access.log171.82.186.225 - - [04/Feb/2020:21:33:24 +0800] "HEAD / HTTP/1.1" 200 0 "-" "curl/7.29.0" "-"

HTTP 協議安全

4.1 Maven 文檔配置

這一段配置代碼，其實是固定的格式，表示當前文檔是 Maven 配置文檔。實例：<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion></project>

1.3 返回數據

根據業務處理完獲得返回實體數據，然后遵從 Http 協議格式構造返回的消息報文。瀏覽器獲得到的數據也會根據 Http 協議進行渲染。

4.2 編寫布局

菜單本身并不涉及到布局的編寫，我們只需要兩個 View，一個綁定給 Context Menu，一個給 Popup Menu：<?xml version="1.0" encoding="utf-8"?><LinearLayout xmlns:android="http://schemas.android.com/apk/res/android" xmlns:app="http://schemas.android.com/apk/res-auto" xmlns:tools="http://schemas.android.com/tools" android:layout_width="match_parent" android:layout_height="match_parent" android:orientation="vertical" > <TextView android:id="@+id/tv_context" android:layout_width="wrap_content" android:layout_height="wrap_content" android:paddingBottom="30dp" android:text="我這里有 Context Menu" android:textSize="20sp" /> <Button android:id="@+id/bt_popup" android:layout_width="wrap_content" android:layout_height="wrap_content" android:onClick="pop" android:text="我這里有 Popup Menu" /></LinearLayout>

1.5 method 屬性

使用表單提交數據時，實際上只發送一個 HTTP 協議的數據請求，HTTP 協議有很多種數據請求方式，這個 method 屬性用于設定 HTTP 請求的方式。常用的方式有 post、get，當未設置時默認使用 get 方式。除了常用方式之外，根據服務器 HTTP 網關的設置，還可以支持：options 客戶端查看服務器的配置；head 用于獲取報文頭，沒有 body 實體；delete 請求服務器刪除指定頁面；put 請求替換服務器端文檔內容；trace 用于診斷服務器；connect 將連接設置成管道方式的代理服務器，用于 HTTP1.1

websocket

網頁中的絕大多數請求使用的是 HTTP 協議，HTTP 是一個無狀態的應用層協議，它有著即開即用的優點，每次請求都是相互獨立的，這對于密集程度較低的網絡請求來說是優點，因為無需創建請求的上下文條件，但是對于密集度或者實時性要求較高的網絡請求（例如 IM 聊天）場景來說，可能 HTTP 會力不從心，因為每創建一個 HTTP 請求對服務器來說都是一個很大的資源開銷。這時我們可以考慮一個相對性能較高的網絡協議 Socket，他的網頁版本被稱為 Websocket。

2.6 跟蹤訪問行為

運行啟動類，訪問 http://127.0.0.1:8080/login?username=imooc&password=123，控制臺輸出如下：控制臺輸出內容可見我們已經完整的跟蹤了一次對 http://127.0.0.1:8080/login 接口的訪問。

1. 前言

上節我們討論了 Spring Security 如何防范 CSRF 攻擊，本節我們討論如何用最簡單的方式提升 Spring Security Web 項目的安全性。Spring Security 可以通過「HTTP 安全響應頭」的方式提升安全性。本節我們討論如何實現 HTTP 安全響應頭。

1. HTTP 協議簡介

4.2 首頁布局

HTTPURLConnection 需要一個觸發時機，所以在首頁布局上我們放置一個 Button 用于觸發 http 請求：<?xml version="1.0" encoding="utf-8"?><LinearLayout xmlns:android="http://schemas.android.com/apk/res/android" android:layout_width="match_parent" android:layout_height="match_parent" android:orientation="vertical"> <Button android:id="@+id/start_http" android:layout_width="wrap_content" android:layout_height="wrap_content" android:layout_gravity="center" android:layout_marginTop="100dp" android:text="發起 Http 請求" /></LinearLayout>

2. 關于 HTTP 防火墻

Servlet 規范中已經為 HttpServletRequest 定義了一些屬性，這些屬性通過 Getter 方法訪問，并用作匹配處理。這些屬性包括：contextPath、servletPath、pathInfo 和 queryString。Spring Security 僅關心應用程序的路徑部分，并不關心 contextPath。另一方面，在 Servlet 的規范中，缺少對 servletPath 和 pathInfo 的規定，比如 URL 中每個路徑段都可能包含參數，然而這些參數是否應該算作 servletPath 或者 pathInfo 值中，規范卻沒有明確說明，并且在不同的 Servlet 容器中，其處理行為也不盡相同。當應用程序被部署在不從路徑中解析參數的容器中時，攻擊者可能將路徑參數添加到請求的 URL 中，從而導致模式匹配的成功或者失敗。還有另一種情況，路徑中可能包含一些如遍歷 /../ 或者多個連續正斜杠 // 此類的內容，這也可能導致模式匹配的失效。有的容器在執行 Servlet 映射之前對其做了規范化處理，但不是所有容器都是。默認情況下，這些容器會自動拒絕未規范化的請求，并刪除路徑參數和重復斜杠。所以，為了保證程序在不同環境的一致性，我們就需要使用 FilterChainProxy 來管理安全過濾器鏈。還要注意一點，servletPath 和 pathInfo 是由容器解析得出的，因此我們還要避免使用分號。路徑的默認匹配策略使用了 Ant 風格，這也是最為常用的一種匹配模式。這個策略是由類 AntPathRequestMatcher 實現的，在 Spring 中由 AntPathMatcher 負責對 servletPath 和 pathInfo 屬性執行不區分大小寫的模式匹配，此過程中不處理 queryString。有時候，我們會需要更復雜的匹配策略，比如正則表達式，這時候就需要用到 RegexRequestMatcher 對象了。URL 匹配并不適合作為訪問控制的唯一策略，我們還需要在服務層使用方法安全性來確保其安全性。由于 URL 是富于變化的，所以我們很難涵蓋所有情況，最好的辦法是采用白名單方式，只允許確認可用的地址被訪問。

HTTP 協議狀態碼-4XX

4XX 的狀態碼指的是請求出錯了，而且很有可能是客戶端側的異常?？蛻舳藗鹊漠惓：芏?，有時候情況也比較復雜，下面定義的狀態碼有時候也只能反應一個大概情況，而不一定確切的。

3.1 串行獲取 <a href="http://baidu.com">baidu.com</a>、<a href="http://taobao.com">taobao.com</a>、<a href="http://qq.com">qq.com</a> 首頁

編寫程序 serial.py，該程序以串行的方式獲取 baidu、taobao、qq 的首頁，內容如下：from datetime import datetimeimport requestsimport threadingdef fetch(url): response = requests.get(url) print('Get %s: %s' % (url, response))time0 = datetime.now()fetch("https://www.baidu.com/")fetch("https://www.taobao.com/")fetch("https://www.qq.com/")time1 = datetime.now()time = time1 - time0print(time.microseconds)在第 5 行，定義了函數 fetch，函數 fetch 獲取指定 url 的網頁。在第 6 行，調用 requests 模塊的 get 方法獲取獲取指定 url 的網頁。在第 9 行，記錄執行的開始時間。在第 11 行到第 13 行，串行執行獲取 baidu、taobao、qq 的首頁。在第 15 行到第 17 行，記錄執行的結束時間，并計算總共花費的時間，time.micoseconds 表示完成需要的時間（微秒）。執行 serial.py，輸出如下：Get https://www.baidu.com/: <Response [200]>Get https://www.taobao.com/: <Response [200]>Get https://www.qq.com/: <Response [200]>683173在輸出中，<Response [200]> 是服務器返回的狀態碼，表示獲取成功。成功獲取了 baidu、taobao、qq 的首頁，總共用時為 683173 微秒。

2.2 101 Switching Protocols

服務器將遵從客戶的請求轉換到另外一種協議。常見的就是 Websocket 連接。客戶端GET /websocket HTTP/1.1Host: www.imocc.comUpgrade: websocketConnection: UpgradeSec-WebSocket-Protocol: chat, superchatSec-WebSocket-Version: 13客戶端請求要將原本是 HTTP/1.1 協議升級成 Websocket 協議。服務端HTTP/1.1 101 Switching ProtocolsUpgrade: websocketConnection: Upgrade服務端返回 101 代表協議轉換成功。

3.3 測試

首先直接請求 http://127.0.0.1:8080/info ，由于此時未登錄，所以請求被攔截，網頁輸出如下：訪問被攔截如果先請求登錄方法 http://127.0.0.1:8080/login?username=imooc&password=123 ，然后訪問 http://127.0.0.1:8080/info ，則網頁輸出：登錄成功后，訪問正常通過攔截器

3.2 代碼集成

開啟 saml2Login() 支持；@EnableWebSecuritypublic class SecurityConfig extends WebSecurityConfigurerAdapter { @Override protected void configure(HttpSecurity http) throws Exception { http .authorizeRequests() .anyRequest().authenticated() .and() .saml2Login() // 啟動 SAML2 認證支持 ; }}為 SAML 2.0 認證配置認證環境；@EnableWebSecuritypublic class SecurityConfig extends WebSecurityConfigurerAdapter { @Override protected void configure(HttpSecurity http) throws Exception { http .authorizeRequests() .anyRequest().authenticated() .and() .saml2Login() .relyingPartyRegistrationRepository(...) // 配置認證環境 ; }}在 SAML 2.0 中，SP 和 IDP 都是作為可信成員，將其映射保存在 RelyingPartyRegistration 對象中，RelyingPartyRegistration 對象通過 HttpSecurity 實例中的 .saml2Login().relyingPartyRegistrationRepository() 方法實現其數值配置。至此，最基礎的 SAML 2.0 的認證配置就已經完成了。

3.1 重定向到 HTTPS

當客戶端使用 HTTP 向服務端發送請求時，Spring Security 可以將請求自動轉換為 HTTPS 的連接方式。例如，如下代碼強制所有 HTTP 請求重定向為 HTTPS 請求：@Configuration@EnableWebSecuritypublic class WebSecurityConfig extends WebSecurityConfigurerAdapter { @Override protected void configure(HttpSecurity http) { http.requiresChannel(channel -> channel.anyRequest().requiresSecure()); }}

2.2 meta 的屬性

name 描述網頁content 方便搜索引擎查找和分類http-equiv http文件頭設置

3. 圖書爬蟲之代碼實現

根據上面的分析，我們來實現相應的代碼。首先是完成獲取計算機的所有分類以及相應的 URL 地址：def get_all_computer_book_urls(page_url): """ 獲取所有計算機分類圖書的url地址 :return: """ response = requests.get(url=page_url, headers=headers) if response.status_code != 200: return [], [] response.encoding = 'gbk' tree = etree.fromstring(response.text, etree.HTMLParser()) # 提取計算機分類的文本列表 c = tree.xpath("http://div[@id='wrap']/ul[1]/li[@class='li']/a/text()") # 提取計算機分類的url列表 u = tree.xpath("http://div[@id='wrap']/ul[1]/li[@class='li']/a/@href") return c, u我們簡單測試下這個函數：[store@server2 chap06]$ python3Python 3.6.8 (default, Apr 2 2020, 13:34:55) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linuxType "help", "copyright", "credits" or "license" for more information.>>> from china_pub_crawler import get_all_computer_book_urls>>> get_all_computer_book_urls('http://www.china-pub.com/Browse/')(['IT圖書網絡出版 [59-00]', '計算機科學理論與基礎知識 [59-01]', '計算機組織與體系結構 [59-02]', '計算機網絡 [59-03]', '安全 [59-04]', '軟件與程序設計 [59-05]', '軟件工程及軟件方法學 [59-06]', '操作系統 [59-07]', '數據庫 [59-08]', '硬件與維護 [59-09]', '圖形圖像、多媒體、網頁制作 [59-10]', '中文信息處理 [59-11]', '計算機輔助設計與工程計算 [59-12]', '辦公軟件 [59-13]', '專用軟件 [59-14]', '人工智能 [59-15]', '考試認證 [59-16]', '工具書 [59-17]', '計算機控制與仿真 [59-18]', '信息系統 [59-19]', '電子商務與計算機文化 [59-20]', '電子工程 [59-21]', '期刊 [59-22]', '游戲 [59-26]', 'IT服務管理 [59-27]', '計算機文化用品 [59-80]'], ['http://product.china-pub.com/cache/browse2/59/1_1_59-00_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-01_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-02_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-03_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-04_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-05_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-06_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-07_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-08_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-09_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-10_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-11_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-12_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-13_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-14_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-15_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-16_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-17_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-18_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-19_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-20_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-21_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-22_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-26_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-27_0.html', 'http://product.china-pub.com/cache/browse2/59/1_1_59-80_0.html'])可以看到這個函數已經實現了我們想要的結果。接下來我們要完成一個函數來獲取對應分類下的所有圖書信息，不過在此之前，我們需要先完成解析單個圖書列表頁面的方法：def parse_books_page(html_data): books = [] tree = etree.fromstring(html_data, etree.HTMLParser()) result_tree = tree.xpath("http://div[@class='search_result']/table/tr/td[2]/ul") for result in result_tree: try: book_info = {} book_info['title'] = result.xpath("./li[@class='result_name']/a/text()")[0] book_info['book_url'] = result.xpath("./li[@class='result_name']/a/@href")[0] info = result.xpath("./li[2]/text()")[0] book_info['author'] = info.split('|')[0].strip() book_info['publisher'] = info.split('|')[1].strip() book_info['isbn'] = info.split('|')[2].strip() book_info['publish_date'] = info.split('|')[3].strip() book_info['vip_price'] = result.xpath("./li[@class='result_book']/ul/li[@class='book_dis']/text()")[0] book_info['price'] = result.xpath("./li[@class='result_book']/ul/li[@class='book_price']/text()")[0] # print(f'解析出的圖書信息為:{book_info}') books.append(book_info) except Exception as e: print("解析數據出現異常，忽略!") return books上面的函數主要解析的是一頁圖書列表數據，同樣基于 xpath 定位相應的元素，然后提取我們想要的數據。其中由于部分信息合在一起，我們在提取數據后還要做相關的處理，分別提取對應的信息。我們可以從網頁中直接樣 HTML 拷貝下來，然后對該函數進行測試：提取圖書列表的網頁數據我們把保存的網頁命名為 test.html，放到與該代碼同級的目錄下，然后進入命令行操作：>>> from china_pub_crawler import parse_books_page>>> f = open('test.html', 'r+')>>> html_content = f.read()>>> parse_books_page(html_content)[{'title': '(特價書)零基礎學ASP.NET 3.5', 'book_url': 'http://product.china-pub.com/216269', 'author': '王向軍;王欣惠（著）', 'publisher': '機械工業出版社', 'isbn': '9787111261414', 'publish_date': '2009-02-01出版', 'vip_price': 'VIP會員價：', 'price': '￥58.00'}, {'title': 'Objective-C 2.0 Mac和iOS開發實踐指南(原書第2版)', 'book_url': 'http://product.china-pub.com/3770704', 'author': '(美)Robert Clair （著）', 'publisher': '機械工業出版社', 'isbn': '9787111484561', 'publish_date': '2015-01-01出版', 'vip_price': 'VIP會員價：', 'price': '￥79.00'}, {'title': '(特價書)ASP.NET 3.5實例精通', 'book_url': 'http://product.china-pub.com/216272', 'author': '王院峰（著）', 'publisher': '機械工業出版社', 'isbn': '9787111259794', 'publish_date': '2009-01-01出版', 'vip_price': 'VIP會員價：', 'price': '￥55.00'}, {'title': '(特價書)CSS+HTML語法與范例詳解詞典', 'book_url': 'http://product.china-pub.com/216275', 'author': '符旭凌（著）', 'publisher': '機械工業出版社', 'isbn': '9787111263647', 'publish_date': '2009-02-01出版', 'vip_price': 'VIP會員價：', 'price': '￥39.00'}, {'title': '(特價書)Java ME 游戲編程(原書第2版)', 'book_url': 'http://product.china-pub.com/216296', 'author': '(美)Martin J. Wells; John P. Flynt （著）', 'publisher': '機械工業出版社', 'isbn': '9787111264941', 'publish_date': '2009-03-01出版', 'vip_price': 'VIP會員價：', 'price': '￥49.00'}, {'title': '(特價書)Visual Basic實例精通', 'book_url': 'http://product.china-pub.com/216304', 'author': '柴相花（著）', 'publisher': '機械工業出版社', 'isbn': '9787111263296', 'publish_date': '2009-04-01出版', 'vip_price': 'VIP會員價：', 'price': '￥59.80'}, {'title': '高性能電子商務平臺構建：架構、設計與開發[按需印刷]', 'book_url': 'http://product.china-pub.com/3770743', 'author': 'ShopNC產品部（著）', 'publisher': '機械工業出版社', 'isbn': '9787111485643', 'publish_date': '2015-01-01出版', 'vip_price': 'VIP會員價：', 'price': '￥79.00'}, {'title': '[套裝書]Java核心技術卷Ⅰ 基礎知識（原書第10版）+Java核心技術卷Ⅱ高級特性（原書第10版）', 'book_url': 'http://product.china-pub.com/7008447', 'author': '（美）凱S.霍斯特曼（Cay S. Horstmann）????（美）凱S. 霍斯特曼（Cay S. Horstmann）（著）', 'publisher': '機械工業出版社', 'isbn': '9787007008447', 'publish_date': '2017-08-01出版', 'vip_price': 'VIP會員價：', 'price': '￥258.00'}, {'title': '(特價書)Dojo構建Ajax應用程序', 'book_url': 'http://product.china-pub.com/216315', 'author': '(美)James E.Harmon （著）', 'publisher': '機械工業出版社', 'isbn': '9787111266648', 'publish_date': '2009-05-01出版', 'vip_price': 'VIP會員價：', 'price': '￥45.00'}, {'title': '(特價書)編譯原理第2版.本科教學版', 'book_url': 'http://product.china-pub.com/216336', 'author': '(美)Alfred V. Aho;Monica S. Lam;Ravi Sethi;Jeffrey D. Ullman （著）', 'publisher': '機械工業出版社', 'isbn': '9787111269298', 'publish_date': '2009-05-01出版', 'vip_price': 'VIP會員價：', 'price': '￥55.00'}, {'title': '(特價書)用Alice學編程(原書第2版)', 'book_url': 'http://product.china-pub.com/216354', 'author': '(美)Wanda P.Dann;Stephen Cooper;Randy Pausch （著）', 'publisher': '機械工業出版社', 'isbn': '9787111274629', 'publish_date': '2009-07-01出版', 'vip_price': 'VIP會員價：', 'price': '￥39.00'}, {'title': 'Java語言程序設計(第2版)', 'book_url': 'http://product.china-pub.com/50051', 'author': '趙國玲;王宏;柴大鵬（著）', 'publisher': '機械工業出版社*', 'isbn': '9787111297376', 'publish_date': '2010-03-01出版', 'vip_price': 'VIP會員價：', 'price': '￥32.00'}, {'title': '從零開始學Python程序設計', 'book_url': 'http://product.china-pub.com/7017939', 'author': '吳惠茹（著）', 'publisher': '機械工業出版社', 'isbn': '9787111583813', 'publish_date': '2018-01-01出版', 'vip_price': 'VIP會員價：', 'price': '￥79.00'}, {'title': '(特價書)匯編語言', 'book_url': 'http://product.china-pub.com/216385', 'author': '鄭曉薇（著）', 'publisher': '機械工業出版社', 'isbn': '9787111269076', 'publish_date': '2009-09-01出版', 'vip_price': 'VIP會員價：', 'price': '￥29.00'}, {'title': '(特價書)Visual Basic.NET案例教程', 'book_url': 'http://product.china-pub.com/216388', 'author': '馬玉春;劉杰民;王鑫（著）', 'publisher': '機械工業出版社', 'isbn': '9787111272571', 'publish_date': '2009-09-01出版', 'vip_price': 'VIP會員價：', 'price': '￥30.00'}, {'title': '小程序從0到1：微信全棧工程師一本通', 'book_url': 'http://product.china-pub.com/7017943', 'author': '石橋碼農（著）', 'publisher': '機械工業出版社', 'isbn': '9787111584049', 'publish_date': '2018-01-01出版', 'vip_price': 'VIP會員價：', 'price': '￥59.00'}, {'title': '深入分布式緩存：從原理到實踐', 'book_url': 'http://product.china-pub.com/7017945', 'author': '于君澤（著）', 'publisher': '機械工業出版社', 'isbn': '9787111585190', 'publish_date': '2018-01-01出版', 'vip_price': 'VIP會員價：', 'price': '￥99.00'}, {'title': '(特價書)ASP.NET AJAX服務器控件高級編程(.NET 3.5版)', 'book_url': 'http://product.china-pub.com/216397', 'author': '(美)Adam Calderon;Joel Rumerman （著）', 'publisher': '機械工業出版社', 'isbn': '9787111270966', 'publish_date': '2009-09-01出版', 'vip_price': 'VIP會員價：', 'price': '￥65.00'}, {'title': 'PaaS程序設計', 'book_url': 'http://product.china-pub.com/3770830', 'author': '(美)Lucas Carlson （著）', 'publisher': '機械工業出版社', 'isbn': '9787111482451', 'publish_date': '2015-01-01出版', 'vip_price': 'VIP會員價：', 'price': '￥39.00'}, {'title': 'Visual C++數字圖像處理[按需印刷]', 'book_url': 'http://product.china-pub.com/2437', 'author': '何斌馬天予王運堅朱紅蓮（著）', 'publisher': '人民郵電出版社', 'isbn': '711509263X', 'publish_date': '2001-04-01出版', 'vip_price': 'VIP會員價：', 'price': '￥72.00'}]是不是能正確提取圖書列表的相關信息？這也說明我們的函數的正確性，由于也可能在解析中存在一些異常，比如某個字段的缺失，我們需要捕獲異常并忽略該條數據，讓程序能繼續走下去而不是停止運行。在完成了上述的工作后，我們來通過對頁號的 URL 構造，實現采集多個分頁下的數據，最后達到讀取完該分類下的所有圖書信息的目的。完整代碼如下：def get_category_books(category, url): """ 獲取類別圖書，下面會有分頁，我們一直請求，直到分頁請求返回404即可停止 :return: """ books = [] page = 1 regex = "(http://.*/)([0-9]+)_(.*).html" pattern = re.compile(regex) m = pattern.match(url) if not m: return [] prefix_path = m.group(1) current_page = m.group(2) if current_page != 1: print("提取數據不是從第一行開始，可能存在問題") suffix_path = m.group(3) current_page = page while True: # 構造分頁請求的URL book_url = f"{prefix_path}{current_page}_{suffix_path}.html" response = requests.get(url=book_url, headers=headers) print(f"提取分類[{category}]下的第{current_page}頁圖書數據") if response.status_code != 200: print(f"[{category}]該分類下的圖書數據提取完畢!") break response.encoding = 'gbk' # 將該分頁的數據加到列表中 books.extend(parse_books_page(response.text)) current_page += 1 # 一定要緩一緩，避免對對方服務造成太大壓力 time.sleep(0.5) return books最后保存數據到 MongoDB 中，這一步非常簡單，我們前面已經操作過 MongoDB 的文檔插入，直接搬用即可：client = pymongo.MongoClient(host='MongoDB的服務地址', port=27017)client.admin.authenticate("admin", "shencong1992")db = client.scrapy_manualcollection = db.china_pub# ...def save_to_mongodb(data): try: collection.insert_many(data) except Exception as e: print("批量插入數據異常:{}".format(str(e)))正是由于我們前面生成了批量的 json 數據，這里直接使用集合的 insert_many() 方法即可對采集到的數據批量插入 MongoDB 中。代碼的最后我們加上一個 main 函數即可：# ...if __name__ == '__main__': page_url = "http://www.china-pub.com/Browse/" categories, urls = get_all_computer_book_urls(page_url) # print(categories) books_total = {} for i in range(len(urls)): books_category_data = get_category_books(categories[i], urls[i]) print(f"保存[{categories[i]}]圖書數據到mongodb中") save_to_mongodb(books_category_data) print("爬取互動出版網的計算機分類數據完成")這樣一個簡單的爬蟲就完成了，還等什么，開始跑起來吧??！

首頁上一頁 1 2 3 4 5 6 7 下一頁尾頁

查看課程詳情

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美