首頁猿問如何循環遍歷csv文件scrapy...

如何循環遍歷csv文件scrapy中的起始網址

Python

嚕嚕噠 2023-04-11 16:28:03

所以基本上它在我第一次運行蜘蛛時出于某種原因起作用了，但之后它只抓取了一個 URL。-我的程序正在抓取我想從列表中刪除的部分。- 將零件列表轉換為文件中的 URL。- 運行并獲取我想要的數據并將其輸入到 csv 文件中。問題：僅從一個 URL 獲取輸出不知道從這里去哪里我檢查了其他資源并嘗試制作 start_request。結果還是一樣。所以基本上我怎樣才能讓它使用所有的 start_urls 并遍歷它們中的每一個而不僅僅是最后一個？這是蜘蛛：import csvimport xlrdimport scrapywb = xlrd.open_workbook(r'C:\Users\Jatencio\PycharmProjects\testy\test.xlsx')ws = wb.sheet_by_index(0)mylist = ws.col_values(0)print(mylist)li = []for el in mylist: baseparts = el[:5] url1 = 'https://www.digikey.com/products/en/integrated-circuits-ics/memory/774?FV=-8%7C774%2C7%7C1&quantity=0&ColumnSort=0&page=1&k=' + baseparts + '&pageSize=500&pkeyword=' + baseparts li.append(url1)final = list(set(li))file = open('templist.csv','w+',newline='')with file: write = csv.writer(file, delimiter =',') write.writerows(x.split(',') for x in final)class DigikeSpider(scrapy.Spider): name = 'digike' allowed_domains = ['digikey.com'] custom_settings = { "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36" } with open('templist.csv') as file: start_urls = [line.strip() for line in file] def parse(self, response): data = {} parts1 = [] # parts=response.css('Table#productTable.productTable') for p in response.css('tbody#lnkPart > tr'): if p.css('td.tr-mfgPartNumber span::text').get() not in mylist: continue else: parts1 = p.css('td.tr-mfgPartNumber span::text').get() if p.css('td.tr-minQty.ptable-param span.desktop::text').get(): quantity = p.css('td.tr-minQty.ptable-param span.desktop::text').get() quantity = quantity.strip() cleaned_quantity = int(quantity.replace(',', '')) else: quantity = 'No quantity'

查看完整描述

1 回答

忽然笑

TA貢獻1806條經驗獲得超5個贊

現在，通過執行日志我可以告訴你蜘蛛中有兩個問題，似乎都與start_urls.

第一個例外：

File "C:\Users\Jatencio\PycharmProjects\testy\testdigi\testdigi\spiders\digike.py", line 93, in parse

'Quantity': cleaned_quantity,

UnboundLocalError: local variable 'cleaned_quantity' referenced before assignment

您在定義它之前引用了它cleaned_quantity。問題在這里：

if p.css('td.tr-minQty.ptable-param span.desktop::text').get():

quantity = p.css('td.tr-minQty.ptable-param span.desktop::text').get()

quantity = quantity.strip()

cleaned_quantity = int(quantity.replace(',', ''))

else:

quantity = 'No quantity'

如果您的 if 語句解析為 false，則永遠不會定義 cleaned_quantity，并且會在您嘗試組裝您的項目時引發錯誤：

yield {

'Part': parts1,

'Quantity': cleaned_quantity,

'Price': cleaned_price,

'Stock': cleaned_stock,

'Status': cleaned_status,

}

這只發生在幾次迭代中，而不是全部。

第二個例外：

File "C:\Users\Jatencio\PycharmProjects\testy\testdigi\testdigi\spiders\digike.py", line 55, in parse

p.css('td.tr-mfgPartNumber span::text').remove()

[...]

File "c:\users\jatencio\pycharmprojects\testy\venv\lib\site-packages\parsel\selector.py", line 371, in remove

raise CannotRemoveElementWithoutRoot(

parsel.selector.CannotRemoveElementWithoutRoot: The node you're trying to remove has no root, are you trying to remove a pseudo-element? Try to use 'li' as a selector instead of 'li::text' or '//li' instead of '//li/text()', for example.

這里的問題是你.remove()在 parsel 調用偽元素的方法中使用方法，你只能用來從 HTML 樹中刪除實際元素，所以我相信這應該可以解決問題：

改變這個：

p.css('td.tr-mfgPartNumber span::text').remove()

對此：

p.css('td.tr-mfgPartNumber span').remove()

您使用該方法的所有行都是這種情況remove。

如果這解決了您的問題，請告訴我。

反對回復 2023-04-11

1 回答
0 關注
134 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

如何循環遍歷csv文件scrapy中的起始網址

如何循環遍歷csv文件scrapy中的起始網址

1 回答

添加回答