亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

為了賬號安全,請及時綁定郵箱和手機立即綁定

檢查了好多遍,還是出現這個問題while self.urls.has_new_url(): ^ SyntaxError: invalid character in identifier,麻煩大家解決一下,多謝啦~

spider_main.py程序代碼如下:
#coding:utf8
import?urllib2
import?urllib.request
from?bs4?import?BeautifulSoup?#?很好用的解析器
import?re?#?很容易遺漏,即使漏掉程序也不報錯,很隱蔽
from?baike_spider?import?url_manager,html_downloader,html_parser,html_outputer
class?SpiderMain(object):
????def?__init__(self):#初始化各個對象
????????self.urls=url_manager.UrlManager()
????????self.downloader=html_downloader.HtmlDownloader()
????????self.parser=html_parser.HtmlParser()#解析器
????????self.outputer=html_outputer.HtmlOutputer()

????def?craw(self,root_url):#爬蟲的調用程序
????????count=1
????????self.urls.add_new_url(root_url)
????????while?self.urls.has_new_url():
????????????try:
????????????????new_url=self.urls.get_new_url()
????????????????print?('craw?%d?:%s'%?(count,new_url))
????????????????html_cont=self.downloader.download(new_url)
????????????????new_urls,new_data=self.parser.parse(new_url,html_cont)
????????????????self.urls.add_new_urls(new_urls)
????????????????self.outputer.collect_data(new_data)
????????????????if?count==100:
????????????????????break
????????????????count?+=?1
????????????except:
????????????????print?('crawl?failed')
?????????self.outputer.output_html()
if?__name__=="__main__":?#?初始化程序
????root_url="http://baike.baidu.com/view/21087.htm"
????obj_spider=SpiderMain()
????obj_spider.craw(root_url)
????
url_manager.py代碼如下:
#coding:utf8
import?urllib2
import?urllib.request
from?bs4?import?BeautifulSoup?#?很好用的解析器
import?re?#?很容易遺漏,即使漏掉程序也不報錯,很隱蔽

class?UrlManager(object):
????def?__init__(self):
????????self.new_urls=set()
????????self.old_urls?=?set()
????def?add_new_url(self,url):
????????if?url?is?None:
????????????return
????????if?url?not?in?self.new_urls?and?url?not?in?self.old_urls:
????????????self.new_urls.add(url)
????def?add_new_urls(self,urls):
????????if?urls?is?None?or?len(urls)==0:
????????????return
????????for?url?in?urls:
????????????self.add_new_url(url)

????#?判斷是否還有沒有爬取的url,返回bool值
????def?has_new_url(self):
????????return?len(self.new_urls)!=0
????def?get_new_url(self):
????????new_url=self.new_urls.pop()#?pop用的極好,不僅取得了數據,而且將元素從set()中刪除
????????self.old_urls.add(new_url)
????????return?new_url


正在回答

2 回答

while self.urls.has_new_url()后面的‘:’,是不是用了中文格式的‘:’


0 回復 有任何疑惑可以回復我~

這應該是縮進的問題,看格式化一下行不,不行的話那就只能手動調整每一行代碼的縮進咯

0 回復 有任何疑惑可以回復我~

舉報

0/150
提交
取消
Python開發簡單爬蟲
  • 參與學習       227596    人
  • 解答問題       1288    個

本教程帶您解開python爬蟲這門神奇技術的面紗

進入課程

檢查了好多遍,還是出現這個問題while self.urls.has_new_url(): ^ SyntaxError: invalid character in identifier,麻煩大家解決一下,多謝啦~

我要回答 關注問題
微信客服

購課補貼
聯系客服咨詢優惠詳情

幫助反饋 APP下載

慕課網APP
您的移動學習伙伴

公眾號

掃描二維碼
關注慕課網微信公眾號