1 回答

TA貢獻1801條經驗 獲得超8個贊
這是將返回新聞文章源代碼以及元數據的代碼。
# wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-07/segments/1454701145519.33/warc/CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc.gz
# gunzip CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc.gz
#!pip install warc3-wet
import warc
var = -10
with warc.open("CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc") as f:
for record in f:
if var > 1:
break
else:
print (record.payload.read(), record.date, record.from_response, record.header, record.ip_address, record.offset, record.payload, record.type, record.url, record.write_to)
var = var + 1
添加回答
舉報