我正在嘗試做一些類似于如何循環遍歷 pandas df 列我的初始數據框logs包含以下內容:ID DB USER MDX_TEXT1 DB1 JOE SELECT [ATTR1].[ATTR1].[THE_ATTR1] ON COLUMNS,[ATTR2].[ATTR2].[THE_ATTR2] ON ROWS FROM [THE_DB] WHERE [ATTR3].[ATTR3].[THE_ATTR3]2 DB1 JAY SELECT [ATTR1].[ATTR1].[THE_ATTR1] ON COLUMNS, [ATTR3].[ATTR3].[THE_ATTR3] ON ROWS FROM [THE_DB] WHERE [ATTR3].[ATTR3].[THE_ATTR3]MDX_TEXT使用正則表達式,然后我提取每個的唯一實例ID# Step 1: Define a pattern to extract MDX regex patternimport repattern = re.compile (r'(\[[\w ]+\]\.\[[\w ]+\](?:\.(?:Members|\[Q\d\]))?)')# Step 2: Create a dataframe to store distinct list of MDX Query attributes, based on patternextrpat = ( logs['MDX_TEXT'].str.extractall(pattern) .drop_duplicates() .to_numpy() )# Step 3: Create a dataframe to store distinct list of attributes used in a queryattr= pd.DataFrame(data=extrpat)attr.rename({0: 'attrname'}, inplace=True, axis=1)# Preview the first 5 lines of the attributes dataframeattr.head()結果是:attrname[THE_ATTR1][THE_ATTR2][THE_ATTR3][THE_ATTR1][THE_ATTR3]我想要的是,除了在步驟 2 中提取唯一屬性之外,還提取IDand USER,如下所示:ID USER attrname1 JOE [THE_ATTR1]1 JOE [THE_ATTR2]1 JOE [THE_ATTR3]2 JAY [THE_ATTR1]2 JAY [THE_ATTR3]最后,加入attr和logs上的數據框ID。這個想法是引入第三個數據框users:USER LOCJOE NYJIL NJMAC CA......我將加入上述內容USER以結束:ID USER LOC attrname1 JOE NY [THE_ATTR1]1 JOE NY [THE_ATTR2]1 JOE NY [THE_ATTR3]2 JAY NJ [THE_ATTR1]2 JAY NJ [THE_ATTR3]
1 回答

猛跑小豬
TA貢獻1858條經驗 獲得超8個贊
模式片是一個好的開始,但是你必須將它與原始數據框合并/加入:
df.index.name = "inx"
pattern = re.compile (r'(\[[\w ]+\]\.\[[\w ]+\])')
# extract the attributes.
extracts = df.MDX_TEXT.str.extractall(pattern).rename(columns={0:"attrname"})
# join the result with the original dataframe.
res = df.join(extracts).reset_index()[["ID", "USER", "attrname"]].drop_duplicates()
# take just the last part of each attribute name.
res["attrname"] = res["attrname"].str.split(".", expand = True).iloc[:, -1]
結果是:
ID USER attrname
0 1 JOE [ATTR1]
1 1 JOE [ATTR2]
2 1 JOE [ATTR3]
3 2 JAY [ATTR1]
4 2 JAY [ATTR3]
添加回答
舉報
0/150
提交
取消