首頁猿問如何在Python中從同一個字符串...

如何在Python中從同一個字符串中提取多個名稱

Python

qq_笑_17 2023-10-31 14:37:12

我正在努力抓取數據并解析字符串中的名稱。例如，我正在使用類似于以下內容的字符串：Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County和Dr. Winston Bishop from UC San Francisco and Usain Bolt from UC San Francisco 是否有代碼可以獲取此類文本并將其轉換為數據集？這樣，數據看起來像這樣： Name AffiliationSharif Amlani UC Davis HealthJoe Biden UC San FranciscoElton John Public Health Director for Davis CountyWinston Bishop UC San FranciscoUsain Bolt UC San Francisco謝謝

查看完整描述

4 回答

慕少森

TA貢獻2019條經驗獲得超9個贊

如果您的字符串始終采用 format name from place and name from place，您可以這樣做：

import pandas as pd

# your consistently formatted string

s = "Dr. Winston Bishop from UC San Francisco and Usain Bolt from UC San Francisco"

l = list() # a list to keep track of data - I am sure there's a better way to do this

for row in s.split('and'): # each row looks like "name from affiliation"

# l = [(name, affiliation), ...]

l.append(n.split((n.strip() for n in row.split('from'))

# then create the DataFrame

df = pd.DataFrame(data = l, columns = ['Name', 'Affiliation'])

# you might want to strip the names and affiliations using pandas DataFrame using a lambda expression

反對回復 2023-10-31

眼眸繁星

TA貢獻1873條經驗獲得超9個贊

您可以進行正則表達式匹配并創建 df. 此處顯示一個字符串的示例方法：

text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr.

Elton John, Public Health Director for Davis County"

text = text.replace(', and' ,',')

re.findall("([\w\s]+),([\w\s]+)",text)

df = pd.DataFrame(r)

df.columns = ("Name", "Affiliation")

print(df)

輸出：

Name Affiliation

0 Sharif Amlani UC Davis Health

1 Joe Biden UC San Francisco

2 Elton John Public Health Director for Davis County

反對回復 2023-10-31

侃侃無極

TA貢獻2051條經驗獲得超10個贊

在抓取過程中，一切都歸結為模式匹配。如果字符串的格式不一致，可能會非常痛苦。不幸的是，就你而言，情況似乎就是這樣。因此，我建議根據具體情況進行處理。

我可以觀察到這樣一種模式，除了一個例外，所有名字都以“博士”開頭。您可以使用它通過正則表達式提取名稱。

import re

text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County"

regex = '((Dr.)( [A-Z]{1}[a-z]+)+)' # this will return three groups of matches

names = [match[0] for match in re.findall(regex, text)] #simply extracting the first group of matches, which is the name

您可以將其應用于其他字符串，但正如我上面提到的，限制是它只能捕獲以“Dr.”開頭的名稱。您也可以對附屬關系使用類似的策略。請注意，“，”分隔名稱和從屬關系，以便我們可以使用它。

import re

text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County"

affiliations = [term for term in text.split(',') if 'Dr.' not in term] # splitting the text by a comma and then excluding results that contain a 'Dr.'

同樣，您必須根據特定文本調整您的解決方案，但希望這可以幫助您思考問題。最后，您可以使用 pandas 將結果合并到數據框中：

import pandas as pd

data = pd.DataFrame(list(zip(names, affiliations)), columns = ['Name', 'Affiliation'])

反對回復 2023-10-31

慕沐林林

TA貢獻2016條經驗獲得超9個贊

以下是此示例文本的示例代碼：

text = "\

Sharif Amlani UC Davis Health\n\

Joe Biden UC San Francisco\n\

Elton John Public Health Director for Davis County\n\

Winston Bishop UC San Francisco\n\

Usain Bolt UC San Francisco"

lines = text.split('\n')

df = pd.concat([pd.DataFrame([[line[0:16].strip(),line[16:].strip()]]) for line in lines])

反對回復 2023-10-31

4 回答
0 關注
234 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

如何在Python中從同一個字符串中提取多個名稱

如何在Python中從同一個字符串中提取多個名稱

4 回答

添加回答