亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

Pandas:根據開始/結束分割點的字符串列表(重疊)將字符串列拆分為組件列

Pandas:根據開始/結束分割點的字符串列表(重疊)將字符串列拆分為組件列

森欄 2022-03-09 20:55:55
在我的 Pandas 字符串數據框中,在一列中我有一個大字符串,我想將其拆分為單獨的字符串,每個字符串都有自己的行一個新的數據框。第二列是一個標簽,相同的標簽應該出現在每個字符串組件上。起點和終點分割點應由一組字符串確定。每個組件字符串將從遇到該集合中的一個字符串開始。每個字符串的起點應該在它自己的行的列中,而不應該在拆分的字符串中。這是一個例子我有一組這些字符串listStrings = { '\nIntroduction' , '\nCase' , '\nLiterature' , '\nBackground',  '\nRelated' , '\nMethods' , '\nMethod','\nTechniques', '\nMethodology','\nResults', '\nResult', '\nExperimental','\nExperiments', '\nExperiment','\nDiscussion' , '\nLimitations','\nConclusion' , '\nConclusions','\nConcluding' ,'Introduction\n' , 'Case\n' , 'Literature\n' , 'Background\n',  'Related\n' , 'Methods\n' , 'Method\n','Techniques\n', 'Methodology\n','Results\n', 'Result\n', 'Experimental\n','Experiments\n', 'Experiment\n','Discussion\n' , 'Limitations\n','Conclusion\n' , 'Conclusions\n','Concluding\n' ,'INTRODUCTION' , 'CASE' , 'LITERATURE' , 'BACKGROUND',  'RELATED' , 'METHODS' , 'METHOD','TECHNIQUES', 'METHODOLOGY','RESULTS', 'RESULT', 'EXPERIMENTAL','EXPERIMENTS', 'EXPERIMENT','DISCUSSION' , 'LIMITATIONS','CONCLUSION' , 'CONCLUSIONS','CONCLUDING' ,'Introduction:' , 'Case:' , 'Literature:' , 'Background:',  'Related:' , 'Methods:' , 'Method:','Techniques:', 'Methodology:','Results:', 'Result:', 'Experimental:','Experiments:', 'Experiment:','Discussion:' , 'Limitations:','Conclusion:' , 'Conclusions:','Concluding:' ,}在 A 列中的字符串到達 中的字符串之一之前listStrings,不要保存任何內容。一旦它到達 中的一個字符串listStrings,將該listStrings字符串作為它自己的單獨列放在新數據框的一行中。然后將那個listStrings字符串之后的所有內容放在一個新行中,直到該段到達另一個字符串listStrings。然后重復該過程:將該字符串放在一個新列中,并為新段創建一個新行,依此類推。
查看完整描述

1 回答

?
大話西游666

TA貢獻1817條經驗 獲得超14個贊

這是一種方法,我不確定大數據集的效率:


# first we build a big regex pattern

pat = '|'.join(listStrings)


# find all keywords in the series

new_df = testdf.A.str.findall(pat)

# 0    [BACKGROUND, METHODS, RESULT, DISCUSSION]

# 1                    [\nResults, \nConclusion]

# 2                [BACKGROUND, METHODS, RESULT]

# Name: A, dtype: object


# find all the chunks by splitting the text with the found keywords

chunks = pd.concat([testdf.A.iloc[[i]].str.split('|'.join(new_df.iloc[i]), expand=True) 

             for i in range(len(testdf))]).stack()


# stack the keywords:

keys = new_df.str.join(' ').str.split(' ', expand=True).stack()


# out return dataframe

# note that we shift the chunks to match the keywords

pd.DataFrame({'D': keys, 'E': chunks.groupby(level=0).shift(-1)})

輸出:


                D                                                  E

0 0    BACKGROUND  \nDiagnostic uncertainty in ALS has serious ma...

  1       METHODS  \nData from 75 ALS patients and 75 healthy con...

  2        RESULT  S\nFollowing predictor variable selection, a c...

  3    DISCUSSION  \nThis study evaluates disease-associated imag...

  4           NaN                                                NaN

1 0     \nResults  : The findings show ICT innovation was effecti...

  1  \nConclusion  : By evaluating the ICT innovation, empirical ...

  2           NaN                                                NaN

2 0    BACKGROUND   AND PURPOSE\nRotator cuff tears are associate...

  1       METHODS  \nSupraspinatus muscle biopsies were obtained ...

  2        RESULT  S\nDegenerative changes were present in both p...

  3           NaN                                                NaN

編輯:


這是解決方案的一個版本,它給出了問題中指定的確切輸出


# first we build a big regex pattern

pat = '|'.join(listStrings)


# find all keywords in the series

new_df = testdf.A.str.findall(pat)

# 0    [BACKGROUND, METHODS, RESULT, DISCUSSION]

# 1                    [\nResults, \nConclusion]

# 2                [BACKGROUND, METHODS, RESULT]

# Name: A, dtype: object


# find all the chunks by splitting the text with the found keywords

chunks = pd.concat([testdf.A.iloc[[i]].str.split('|'.join(new_df.iloc[i]), expand=True) 

             for i in range(len(testdf))]).stack()


# stack the keywords:

keys = np.concatenate(new_df.values) # Flatten the keywords array

values = chunks.groupby(level=0).shift(-1).dropna().values

labels = np.concatenate([len(val) * [testdf['B'][ind]] for ind, val in enumerate(new_df.values)]) 

# out return dataframe

# note that we shift the chunks to match the keywords

pd.DataFrame({'C': keys, 'D': values, 'E': labels})

輸出:


C   D   E

0   BACKGROUND  \nDiagnostic uncertainty in ALS has serious ma...   Entry1

1   METHODS \nData from 75 ALS patients and 75 healthy con...   Entry1

2   RESULTS \nFollowing predictor variable selection, a cl...   Entry1

3   DISCUSSION  \nThis study evaluates disease-associated imag...   Entry1

4   \nResult    s: The findings show ICT innovation was effect...   Entry2

5   \nConclusion    : By evaluating the ICT innovation, empirical ...   Entry2

6   BACKGROUND  AND PURPOSE\nRotator cuff tears are associate...    Entry3

7   METHODS \nSupraspinatus muscle biopsies were obtained ...   Entry3

8   RESULTS \nDegenerative changes were present in both pa...   Entry3


查看完整回答
反對 回復 2022-03-09
  • 1 回答
  • 0 關注
  • 166 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯系客服咨詢優惠詳情

幫助反饋 APP下載

慕課網APP
您的移動學習伙伴

公眾號

掃描二維碼
關注慕課網微信公眾號