3 回答

TA貢獻1784條經驗 獲得超8個贊
從文本中找出句子是很困難的。通常,您會查找可以完成句子的字符,例如“.”。和 '!'。但句點(“.”)可能出現在句子的中間,例如人名的縮寫。我使用正則表達式來查找句點,后跟單個空格或字符串末尾,這適用于前三個句子,但不適用于任何任意句子。
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
paragraphs = soup.select('section.article_text p')
sentences = []
for paragraph in paragraphs:
matches = re.findall(r'(.+?[.!])(?: |$)', paragraph.text)
needed = 3 - len(sentences)
found = len(matches)
n = min(found, needed)
for i in range(n):
sentences.append(matches[i])
if len(sentences) == 3:
break
print(sentences)
印刷:
['Many people will land on this page after learning that their email address has appeared in a data breach I\'ve called "Collection #1".', "Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.", "Let's start with the raw numbers because that's the headline, then I'll drill down into where it's from and what it's composed of."]

TA貢獻1806條經驗 獲得超5個贊
實際上使用beautify soup你可以通過類“article_text post”進行過濾,查看源代碼:
myData=soup.find('section',class_ = "article_text post") print(myData.p.text)
并獲取p元素的內部文本
用這個代替soup = BeautifulSoup(html_page, 'html.parser')

TA貢獻2051條經驗 獲得超10個贊
要抓取前三個句子,只需將這些行添加到您的代碼中:
section = soup.find('section',class_ = "article_text post") #Finds the section tag with class "article_text post"
txt = section.p.text #Gets the text within the first p tag within the variable section (the section tag)
print(txt)
輸出:
Many people will land on this page after learning that their email address has appeared in a data breach I've called "Collection #1". Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.
希望這有幫助!
添加回答
舉報