4 回答

TA貢獻2019條經驗 獲得超9個贊
如果您的字符串始終采用 format name from place and name from place,您可以這樣做:
import pandas as pd
# your consistently formatted string
s = "Dr. Winston Bishop from UC San Francisco and Usain Bolt from UC San Francisco"
l = list() # a list to keep track of data - I am sure there's a better way to do this
for row in s.split('and'): # each row looks like "name from affiliation"
# l = [(name, affiliation), ...]
l.append(n.split((n.strip() for n in row.split('from'))
# then create the DataFrame
df = pd.DataFrame(data = l, columns = ['Name', 'Affiliation'])
# you might want to strip the names and affiliations using pandas DataFrame using a lambda expression

TA貢獻1873條經驗 獲得超9個贊
您可以進行正則表達式匹配并創建 df. 此處顯示一個字符串的示例方法:
text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr.
Elton John, Public Health Director for Davis County"
text = text.replace(', and' ,',')
re.findall("([\w\s]+),([\w\s]+)",text)
df = pd.DataFrame(r)
df.columns = ("Name", "Affiliation")
print(df)
輸出:
Name Affiliation
0 Sharif Amlani UC Davis Health
1 Joe Biden UC San Francisco
2 Elton John Public Health Director for Davis County

TA貢獻2051條經驗 獲得超10個贊
在抓取過程中,一切都歸結為模式匹配。如果字符串的格式不一致,可能會非常痛苦。不幸的是,就你而言,情況似乎就是這樣。因此,我建議根據具體情況進行處理。
我可以觀察到這樣一種模式,除了一個例外,所有名字都以“博士”開頭。您可以使用它通過正則表達式提取名稱。
import re
text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County"
regex = '((Dr.)( [A-Z]{1}[a-z]+)+)' # this will return three groups of matches
names = [match[0] for match in re.findall(regex, text)] #simply extracting the first group of matches, which is the name
您可以將其應用于其他字符串,但正如我上面提到的,限制是它只能捕獲以“Dr.”開頭的名稱。您也可以對附屬關系使用類似的策略。請注意,“,”分隔名稱和從屬關系,以便我們可以使用它。
import re
text = "Dr. Sharif Amlani, UC Davis Health, Dr. Joe Biden, UC San Francisco, and Dr. Elton John, Public Health Director for Davis County"
affiliations = [term for term in text.split(',') if 'Dr.' not in term] # splitting the text by a comma and then excluding results that contain a 'Dr.'
同樣,您必須根據特定文本調整您的解決方案,但希望這可以幫助您思考問題。最后,您可以使用 pandas 將結果合并到數據框中:
import pandas as pd
data = pd.DataFrame(list(zip(names, affiliations)), columns = ['Name', 'Affiliation'])

TA貢獻2016條經驗 獲得超9個贊
以下是此示例文本的示例代碼:
text = "\
Sharif Amlani UC Davis Health\n\
Joe Biden UC San Francisco\n\
Elton John Public Health Director for Davis County\n\
Winston Bishop UC San Francisco\n\
Usain Bolt UC San Francisco"
lines = text.split('\n')
df = pd.concat([pd.DataFrame([[line[0:16].strip(),line[16:].strip()]]) for line in lines])
添加回答
舉報