我想嘗試在 Scrapy 中自動化我的 html 表格抓取。這是我到目前為止所擁有的:import scrapyimport pandas as pdclass XGSpider(scrapy.Spider): name = 'expectedGoals' start_urls = [ 'https://fbref.com/en/comps/9/schedule/Premier-League-Scores-and-Fixtures', ] def parse(self, response): matches = [] for row in response.xpath('//*[@id="sched_ks_3232_1"]//tbody/tr'): match = { 'home': row.xpath('td[4]//text()').extract_first(), 'homeXg': row.xpath('td[5]//text()').extract_first(), 'score': row.xpath('td[6]//text()').extract_first(), 'awayXg': row.xpath('td[7]//text()').extract_first(), 'away': row.xpath('td[8]//text()').extract_first() } matches.append(match) x = pd.DataFrame( matches, columns=['home', 'homeXg', 'score', 'awayXg', 'away']) yield x.to_csv("xG.csv", sep=",", index=False)它工作正常,但是如您所見,我正在對對象的鍵(home、homeXg等)進行硬編碼match。我想自動將鍵抓取到列表中,然后用所述列表中的鍵初始化字典。問題是,我不知道如何按索引遍歷 xpath。舉個例子, headers = [] for row in response.xpath('//*[@id="sched_ks_3260_1"]/thead/tr'): yield{ 'first': row.xpath('th[1]/text()').extract_first(), 'second': row.xpath('th[2]/text()').extract_first() }是否可以將th[1]、th[2]等th[3]插入 for 循環,將數字作為索引,并將值附加到列表中?例如row.xpath('th[i]/text()').extract_first() ?
1 回答

弒天下
TA貢獻1818條經驗 獲得超8個贊
未經測試但應該可以工作:
column_index = 1
columns = {}
for column_node in response.xpath('//*[@id="sched_ks_3260_1"]/thead/tr/th'):
column_name = column_node.xpath('./text()').extract_first()
columns[column_name] = column_index
column_index += 1
matches = []
for row in response.xpath('//*[@id="sched_ks_3232_1"]//tbody/tr'):
match = {}
for column_name in columns.keys():
match[column_name] = row.xpath('./td[{index}]//text()'.format(index=columns[column_name])).extract_first()
matches.append(match)
添加回答
舉報
0/150
提交
取消