1 回答

TA貢獻1798條經驗 獲得超3個贊
我認為您的抓取邏輯是正確的,但是在您的循環中,您每次都在執行 GET + POST,而您應該第一次執行 GET,然后為下一次迭代發出 POST(如果您想要 1 次迭代 = 1 頁)
一個例子 :
import requests
from bs4 import BeautifulSoup
res_url = 'https://www.brcdirectory.com/InternalSite//Siteresults.aspx?'
params = {
'CountryId': '0',
'CategoryId': '49bd499b-bc70-4cac-9a29-0bd1f5422f6f',
'StandardId': '972f3b26-5fbd-4f2c-9159-9a50a15a9dde'
}
max_page = 20
def extract(page, soup):
for item_link in soup.select("h4 a.colorBlue"):
print("for page {} - {}".format(page, item_link.get("href")))
def build_payload(page, soup):
payload = {}
for input_item in soup.select("input"):
payload[input_item["name"]] = input_item["value"]
payload["__EVENTTARGET"]="ctl00$ContentPlaceHolder1$gv_Results"
payload["__EVENTARGUMENT"]="Page${}".format(page)
payload["ctl00$ContentPlaceHolder1$ddl_SortValue"] = "SiteName"
return payload
with requests.Session() as s:
for page in range(1, max_page):
if (page > 1):
req = s.post(res_url, params = params, data = build_payload(page, soup))
else:
req = s.get(res_url,params=params)
soup = BeautifulSoup(req.text,"lxml")
extract(page, soup)
添加回答
舉報