首頁猿問 Pandas：根據開始/結束分割點...

Pandas：根據開始/結束分割點的字符串列表（重疊）將字符串列拆分為組件列

Python

森欄 2022-03-09 20:55:55

在我的 Pandas 字符串數據框中，在一列中我有一個大字符串，我想將其拆分為單獨的字符串，每個字符串都有自己的行一個新的數據框。第二列是一個標簽，相同的標簽應該出現在每個字符串組件上。起點和終點分割點應由一組字符串確定。每個組件字符串將從遇到該集合中的一個字符串開始。每個字符串的起點應該在它自己的行的列中，而不應該在拆分的字符串中。這是一個例子我有一組這些字符串listStrings = { '\nIntroduction' , '\nCase' , '\nLiterature' , '\nBackground', '\nRelated' , '\nMethods' , '\nMethod','\nTechniques', '\nMethodology','\nResults', '\nResult', '\nExperimental','\nExperiments', '\nExperiment','\nDiscussion' , '\nLimitations','\nConclusion' , '\nConclusions','\nConcluding' ,'Introduction\n' , 'Case\n' , 'Literature\n' , 'Background\n', 'Related\n' , 'Methods\n' , 'Method\n','Techniques\n', 'Methodology\n','Results\n', 'Result\n', 'Experimental\n','Experiments\n', 'Experiment\n','Discussion\n' , 'Limitations\n','Conclusion\n' , 'Conclusions\n','Concluding\n' ,'INTRODUCTION' , 'CASE' , 'LITERATURE' , 'BACKGROUND', 'RELATED' , 'METHODS' , 'METHOD','TECHNIQUES', 'METHODOLOGY','RESULTS', 'RESULT', 'EXPERIMENTAL','EXPERIMENTS', 'EXPERIMENT','DISCUSSION' , 'LIMITATIONS','CONCLUSION' , 'CONCLUSIONS','CONCLUDING' ,'Introduction:' , 'Case:' , 'Literature:' , 'Background:', 'Related:' , 'Methods:' , 'Method:','Techniques:', 'Methodology:','Results:', 'Result:', 'Experimental:','Experiments:', 'Experiment:','Discussion:' , 'Limitations:','Conclusion:' , 'Conclusions:','Concluding:' ,}在 A 列中的字符串到達中的字符串之一之前listStrings，不要保存任何內容。一旦它到達中的一個字符串listStrings，將該listStrings字符串作為它自己的單獨列放在新數據框的一行中。然后將那個listStrings字符串之后的所有內容放在一個新行中，直到該段到達另一個字符串listStrings。然后重復該過程：將該字符串放在一個新列中，并為新段創建一個新行，依此類推。

查看完整描述

1 回答

大話西游666

TA貢獻1817條經驗獲得超14個贊

這是一種方法，我不確定大數據集的效率：

# first we build a big regex pattern

pat = '|'.join(listStrings)

# find all keywords in the series

new_df = testdf.A.str.findall(pat)

# 0 [BACKGROUND, METHODS, RESULT, DISCUSSION]

# 1 [\nResults, \nConclusion]

# 2 [BACKGROUND, METHODS, RESULT]

# Name: A, dtype: object

# find all the chunks by splitting the text with the found keywords

chunks = pd.concat([testdf.A.iloc[[i]].str.split('|'.join(new_df.iloc[i]), expand=True)

for i in range(len(testdf))]).stack()

# stack the keywords:

keys = new_df.str.join(' ').str.split(' ', expand=True).stack()

# out return dataframe

# note that we shift the chunks to match the keywords

pd.DataFrame({'D': keys, 'E': chunks.groupby(level=0).shift(-1)})

輸出：

D E

0 0 BACKGROUND \nDiagnostic uncertainty in ALS has serious ma...

1 METHODS \nData from 75 ALS patients and 75 healthy con...

2 RESULT S\nFollowing predictor variable selection, a c...

3 DISCUSSION \nThis study evaluates disease-associated imag...

4 NaN NaN

1 0 \nResults : The findings show ICT innovation was effecti...

1 \nConclusion : By evaluating the ICT innovation, empirical ...

2 NaN NaN

2 0 BACKGROUND AND PURPOSE\nRotator cuff tears are associate...

1 METHODS \nSupraspinatus muscle biopsies were obtained ...

2 RESULT S\nDegenerative changes were present in both p...

3 NaN NaN

編輯：

這是解決方案的一個版本，它給出了問題中指定的確切輸出

# first we build a big regex pattern

pat = '|'.join(listStrings)

# find all keywords in the series

new_df = testdf.A.str.findall(pat)

# 0 [BACKGROUND, METHODS, RESULT, DISCUSSION]

# 1 [\nResults, \nConclusion]

# 2 [BACKGROUND, METHODS, RESULT]

# Name: A, dtype: object

# find all the chunks by splitting the text with the found keywords

chunks = pd.concat([testdf.A.iloc[[i]].str.split('|'.join(new_df.iloc[i]), expand=True)

for i in range(len(testdf))]).stack()

# stack the keywords:

keys = np.concatenate(new_df.values) # Flatten the keywords array

values = chunks.groupby(level=0).shift(-1).dropna().values

labels = np.concatenate([len(val) * [testdf['B'][ind]] for ind, val in enumerate(new_df.values)])

# out return dataframe

# note that we shift the chunks to match the keywords

pd.DataFrame({'C': keys, 'D': values, 'E': labels})

輸出：

C D E

0 BACKGROUND \nDiagnostic uncertainty in ALS has serious ma... Entry1

1 METHODS \nData from 75 ALS patients and 75 healthy con... Entry1

2 RESULTS \nFollowing predictor variable selection, a cl... Entry1

3 DISCUSSION \nThis study evaluates disease-associated imag... Entry1

4 \nResult s: The findings show ICT innovation was effect... Entry2

5 \nConclusion : By evaluating the ICT innovation, empirical ... Entry2

6 BACKGROUND AND PURPOSE\nRotator cuff tears are associate... Entry3

7 METHODS \nSupraspinatus muscle biopsies were obtained ... Entry3

8 RESULTS \nDegenerative changes were present in both pa... Entry3

反對回復 2022-03-09

1 回答
0 關注
166 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

Pandas：根據開始/結束分割點的字符串列表（重疊）將字符串列拆分為組件列

Pandas：根據開始/結束分割點的字符串列表（重疊）將字符串列拆分為組件列

1 回答

添加回答